<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.5">Jekyll</generator><link href="https://shawnh2.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://shawnh2.github.io/" rel="alternate" type="text/html" /><updated>2024-07-12T17:50:24+08:00</updated><id>https://shawnh2.github.io/feed.xml</id><title type="html">Tech the Tempest</title><subtitle>&quot;&quot;
</subtitle><author><name>Your Name</name><email>shawnhxh@outlook.com</email></author><entry><title type="html">Merbridge: 基于 eBPF 加速 Istio 的流量转发能力</title><link href="https://shawnh2.github.io/post/2024/06/06/merbridge-dive-in.html" rel="alternate" type="text/html" title="Merbridge: 基于 eBPF 加速 Istio 的流量转发能力" /><published>2024-06-06T00:00:00+08:00</published><updated>2024-06-06T00:00:00+08:00</updated><id>https://shawnh2.github.io/post/2024/06/06/merbridge-dive-in</id><content type="html" xml:base="https://shawnh2.github.io/post/2024/06/06/merbridge-dive-in.html"><![CDATA[<blockquote>
  <p>本文代码基于 Merbridge <a href="https://github.com/merbridge/merbridge/tree/c16cc436ca0a27570be2b42bb3caccced774e614">HEAD c16cc43</a> 展开。</p>
</blockquote>

<h2 id="简介">简介</h2>

<p>Merbridge 是基于 eBPF 实现的一套可用于服务网格中流量拦截与高性能转发的方案，其支持多种服务网格项目（Istio、Kuma、Linkerd 等）适配，本文只以 Istio Sidecar 模式为例展开。</p>

<p>具体来讲（以 Istio Sidecar 模式为例），下图为原始流量路径：</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-06-06/istio-sidecar-traffic.png" alt="istio-sidecar-traffic.png" /></p>

<!--more-->

<p>在使用 Merbridge 后，可有效减少业务数据包与内核网络交互的次数，服务间的网络数据路径就只剩下代理之间的了。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-06-06/merbridge-traffic.png" alt="merbridge-traffic.png" /></p>

<p>甚至，若两个 Pod 位于同一个 Node 之上，它们之间的网络数据路径还能更加简洁。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-06-06/merbridge-traffic-same-node.png" alt="merbridge-traffic-same-node.png" /></p>

<h2 id="组成">组成</h2>

<p>Merbridge 以 DaemonSet 方式运行在集群中，其运行启动时会：</p>

<ul>
  <li>首先加载（Load）所有 eBPF 程序</li>
  <li>其次启动 Controller</li>
  <li>最后关联（Attach）所有 eBPF 程序</li>
</ul>

<h3 id="ebpf-程序清单">eBPF 程序清单</h3>

<p>其中，无论是加载（Load）还是关联（Attach）eBPF 程序，Merbridge 都是以直接执行 <code class="language-plaintext highlighter-rouge">bpftool</code> 命令的方法进行的，所有的 eBPF 程序都会被挂载到 <code class="language-plaintext highlighter-rouge">/sys/fs/bpf</code> 路径下。</p>

<p>Merbridge 共操作以下几种 eBPF 程序：</p>

<table>
  <thead>
    <tr>
      <th>name</th>
      <th>mount path</th>
      <th>attach to</th>
      <th>attach type</th>
      <th>attach prog</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>connect</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/connect</code></td>
      <td>cgroup2</td>
      <td>connect4/6</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/connect/cgroup_connect</code> 4/6</td>
    </tr>
    <tr>
      <td>sockops</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/sockops</code></td>
      <td>cgroup2</td>
      <td>sock_ops</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/sockops</code></td>
    </tr>
    <tr>
      <td>get_sockopts</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/get_sockopts</code></td>
      <td>cgroup2</td>
      <td>getsockopt</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/get_sockopts</code></td>
    </tr>
    <tr>
      <td>redir</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/redir</code></td>
      <td>prog</td>
      <td>msg_verdict</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/redir</code></td>
    </tr>
    <tr>
      <td>bind</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/bind</code></td>
      <td>cgroup2</td>
      <td>bind4</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/bind</code></td>
    </tr>
    <tr>
      <td>sendmsg</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/sendmsg</code></td>
      <td>cgroup2</td>
      <td>sendmsg4/6</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/sendmsg/cgroup_sendmsg</code> 4/6</td>
    </tr>
    <tr>
      <td>recvmsg</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/recvmsg</code></td>
      <td>cgroup2</td>
      <td>recvmsg4/6</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/recvmsg/cgroup_recvmsg</code> 4/6</td>
    </tr>
    <tr>
      <td>mb_process</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/mb_process</code></td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
  </tbody>
</table>

<p>除此之外，Merbridge 还创建了以下 bpf map：</p>

<table>
  <thead>
    <tr>
      <th>name</th>
      <th>mount path</th>
      <th>type</th>
      <th>注释</th>
      <th>used by</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>cookie_original_dst</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/cookie_original_dst</code></td>
      <td>lru_hash</td>
      <td>socket cookie address 与流量原始目的地址的 1:1 映射</td>
      <td>connect｜sockops｜sendmsg｜recvmsg</td>
    </tr>
    <tr>
      <td>local_pod_ips</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/local_pod_ips</code></td>
      <td>hash</td>
      <td>pod IP 与 <code class="language-plaintext highlighter-rouge">podConfig</code> 的 1:1 映射</td>
      <td>connect</td>
    </tr>
    <tr>
      <td>process_ip</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/process_ip</code></td>
      <td>lru_hash</td>
      <td>process id 与 pod IP 的 1:1 映射</td>
      <td>connect｜sockops</td>
    </tr>
    <tr>
      <td>cgroup_info_map</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/cgroup_info_map</code></td>
      <td>lru_hash</td>
      <td>cgroup id 与 cgroup info 的 1:1 映射</td>
      <td>connect｜bind｜sendmsg｜recvmsg</td>
    </tr>
    <tr>
      <td>mark_pod_ips_map</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/mark_pod_ips_map</code></td>
      <td>hash</td>
      <td> </td>
      <td>connect｜sendmsg｜recvmsg</td>
    </tr>
    <tr>
      <td>settings</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/settings</code></td>
      <td>hash</td>
      <td> </td>
      <td>connect｜sockops｜bind</td>
    </tr>
    <tr>
      <td>pair_original_dst</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/pair_original_dst</code></td>
      <td>lru_hash</td>
      <td>四元组与原始目的地址的 1:1 映射</td>
      <td>sockops｜get_sockopts</td>
    </tr>
    <tr>
      <td>sock_pair_map</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/sock_pair_map</code></td>
      <td>sockhash</td>
      <td>sock 与四元组的 1:1 映射</td>
      <td>sockops｜redir</td>
    </tr>
    <tr>
      <td>process_events</td>
      <td><code class="language-plaintext highlighter-rouge">/sys/fs/bpf/connect/process_events</code></td>
      <td>perf_event_array</td>
      <td> </td>
      <td>mb_process</td>
    </tr>
  </tbody>
</table>

<h3 id="local-ip-controller">Local IP Controller</h3>

<p>Merbridge 启动的 Controller 名为 Local IP Controller，其本质上是一个包含了对 Pod 和 Namespace 资源监听的 Informer。</p>

<p>由于 Merbridge 以 DaemonSet 模式运行，故每个 Node 上的 Merbridge 只监听<strong>当前节点中所有 Pod 的资源变化</strong>。并在监听到 Istio 所管理的 Pod 资源变化时（具体来说就是被注入了 Sidecar）更新 <code class="language-plaintext highlighter-rouge">local_pod_ips</code> 这个 bpf map，其中 map 的 key 为 Pod IP，value 为 <code class="language-plaintext highlighter-rouge">podConfig</code> 结构体：</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">podConfig</span> <span class="k">struct</span> <span class="p">{</span>  
	<span class="n">statusPort</span> <span class="kt">uint16</span>  
	<span class="n">_</span> <span class="kt">uint16</span> <span class="c">// pad  </span>
	<span class="n">excludeOutRanges</span> <span class="p">[</span><span class="n">MaxItemLen</span><span class="p">]</span><span class="n">cidr</span>     <span class="o">===&gt;</span>   <span class="k">type</span> <span class="n">cidr</span> <span class="k">struct</span> <span class="p">{</span>  
	<span class="n">includeOutRanges</span> <span class="p">[</span><span class="n">MaxItemLen</span><span class="p">]</span><span class="n">cidr</span>                <span class="n">net</span> <span class="kt">uint32</span> <span class="c">// network order  </span>
	<span class="n">includeInPorts</span>   <span class="p">[</span><span class="n">MaxItemLen</span><span class="p">]</span><span class="kt">uint16</span>              <span class="n">mask</span> <span class="kt">uint8</span>  
	<span class="n">includeOutPorts</span>  <span class="p">[</span><span class="n">MaxItemLen</span><span class="p">]</span><span class="kt">uint16</span>              <span class="n">_</span> <span class="p">[</span><span class="m">3</span><span class="p">]</span><span class="kt">uint8</span> <span class="c">// pad  </span>
	<span class="n">excludeInPorts</span>   <span class="p">[</span><span class="n">MaxItemLen</span><span class="p">]</span><span class="kt">uint16</span>          <span class="p">}</span>
	<span class="n">excludeOutPorts</span>  <span class="p">[</span><span class="n">MaxItemLen</span><span class="p">]</span><span class="kt">uint16</span>  
<span class="p">}</span>

<span class="k">const</span> <span class="n">MaxItemLen</span> <span class="o">=</span> <span class="m">20</span>
</code></pre></div></div>

<p>这些结构体字段记录的信息同 <a href="https://istio.io/latest/docs/reference/config/annotations/#SidecarTrafficExcludeInboundPorts">Istio Resource Annotations</a>。在 Controller 的实现中，它们都是通过解析 Pod 的 anntations 获取的。</p>

<h2 id="工作方式">工作方式</h2>

<p>若无特别说明，本部分只关注 IPv4 协议的网络。</p>

<p>回忆在 Istio 中 Sidecar 拦截流量是通过 iptables 的手段，将应用向外部的流量被 iptables 的 OUTPUT 拦截，转发至 Sidecar 的 15001 端口；外部向应用的流量则是被 iptables 中的 PREROUTING 拦截，转发至 Sidecar 的 15006 端口。</p>

<p>Istio 使用 iptables 的 DNAT 功能做流量转发，Merbridge 则使用 eBPF 实现，为了能够达到 iptables DNAT 能力的效果，需要：</p>

<ul>
  <li>修改连接发起时的目的地址，让流量能够发送至新的端口</li>
  <li>让 Envoy 能够识别流量原始的目的地址</li>
</ul>

<h3 id="出口流量处理">出口流量处理</h3>

<p>本节以 TCP 连接为例，介绍从应用容器（App）到 Sidecar Envoy 的 15001 端口连接建立的过程。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-06-06/merbridge-outbound.png" alt="merbridge-outbound.png" /></p>

<p>对于从应用容器的出口流量，需要将其重定向到 Sidecar Envoy 的 15001 端口（即 <code class="language-plaintext highlighter-rouge">127.0.0.1:15001</code>）。</p>

<p>1. 在应用向外发起连接时，<code class="language-plaintext highlighter-rouge">connect</code> eBPF 程序会将目的地址修改为 <code class="language-plaintext highlighter-rouge">127.x.y.z:15001</code> ，并使用 <code class="language-plaintext highlighter-rouge">cookie_original_dst</code> map 保存流量原始的目的地址。不修改目的地址为 <code class="language-plaintext highlighter-rouge">127.0.0.1</code> 的原因是：<strong>避免不同 Pod 中产生冲突的四元组信息</strong>。</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">__u32</span> <span class="n">outip</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">int</span> <span class="nf">tcp_connect4</span><span class="p">(</span><span class="k">struct</span> <span class="n">bpf_sock_addr</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// 通过从 cgroup_info_map 中获取的 cgroup_info 来判断是否为服务网格中 Pod 的流量</span>
    <span class="c1">// ...</span>
    
    <span class="n">__u32</span> <span class="n">curr_pod_ip</span><span class="p">;</span>
    <span class="n">__u32</span> <span class="n">_curr_pod_ip</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="n">set_ipv6</span><span class="p">(</span><span class="n">_curr_pod_ip</span><span class="p">,</span> <span class="n">cg_info</span><span class="p">.</span><span class="n">cgroup_ip</span><span class="p">);</span>
    <span class="n">curr_pod_ip</span> <span class="o">=</span> <span class="n">get_ipv4</span><span class="p">(</span><span class="n">_curr_pod_ip</span><span class="p">);</span>
    
    <span class="n">__u64</span> <span class="n">uid</span> <span class="o">=</span> <span class="n">bpf_get_current_uid_gid</span><span class="p">()</span> <span class="o">&amp;</span> <span class="mh">0xffffffff</span><span class="p">;</span>
    <span class="n">__u32</span> <span class="n">dst_ip</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">user_ip4</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">uid</span> <span class="o">!=</span> <span class="n">SIDECAR_USER_ID</span><span class="p">)</span> <span class="p">{</span>  <span class="c1">// 1337 是 Istio 为 sidecar 预留的 Application UIDs</span>
        <span class="c1">// 忽略目的地址为 127 开头的本地流量</span>
        <span class="k">if</span> <span class="p">((</span><span class="n">dst_ip</span> <span class="o">&amp;</span> <span class="mh">0xff</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0x7f</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
        
        <span class="n">__u64</span> <span class="n">cookie</span> <span class="o">=</span> <span class="n">bpf_get_socket_cookie_addr</span><span class="p">(</span><span class="n">ctx</span><span class="p">);</span>

        <span class="c1">// 即将重定向流量至 Envoy，此处把重定向之前真正要发往的目的地信息记录下来，即原始目的地址</span>
        <span class="k">struct</span> <span class="n">origin_info</span> <span class="n">origin</span><span class="p">;</span>
        <span class="n">memset</span><span class="p">(</span><span class="o">&amp;</span><span class="n">origin</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">origin</span><span class="p">));</span>
        <span class="n">set_ipv4</span><span class="p">(</span><span class="n">origin</span><span class="p">.</span><span class="n">ip</span><span class="p">,</span> <span class="n">dst_ip</span><span class="p">);</span>
        <span class="n">origin</span><span class="p">.</span><span class="n">port</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">user_port</span><span class="p">;</span>
        <span class="n">origin</span><span class="p">.</span><span class="n">flags</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">bpf_map_update_elem</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cookie_original_dst</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">cookie</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">origin</span><span class="p">,</span> <span class="n">BPF_ANY</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
        <span class="p">}</span>
        
        <span class="k">if</span> <span class="p">(</span><span class="n">curr_pod_ip</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">struct</span> <span class="n">pod_config</span> <span class="o">*</span><span class="n">pod</span> <span class="o">=</span> <span class="n">bpf_map_lookup_elem</span><span class="p">(</span><span class="o">&amp;</span><span class="n">local_pod_ips</span><span class="p">,</span> <span class="n">_curr_pod_ip</span><span class="p">);</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">pod</span><span class="p">)</span> <span class="p">{</span>
	            <span class="cm">/* 根据各种 Exclude/Include Out Ports/Ranges 信息来判断是否还进一步向后执行；
	               podConfig 中的各种 Pod 配置信息是经由 Local IP Controller 获取的。
	             */</span>
	            <span class="c1">// ...</span>
            <span class="p">}</span>

            <span class="c1">// 对于存在 Pod IP 的情况，将与 ctx 关联的 socket 绑定到 Pod 的 IP 地址上</span>
            <span class="k">struct</span> <span class="n">sockaddr_in</span> <span class="n">addr</span> <span class="o">=</span> <span class="p">{</span>
                <span class="p">.</span><span class="n">sin_addr</span> <span class="o">=</span>
                    <span class="p">{</span>
                        <span class="p">.</span><span class="n">s_addr</span> <span class="o">=</span> <span class="n">curr_pod_ip</span><span class="p">,</span>
                    <span class="p">},</span>
                <span class="p">.</span><span class="n">sin_port</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>    <span class="c1">// 端口由内核随机指定一个未被使用的</span>
                <span class="p">.</span><span class="n">sin_family</span> <span class="o">=</span> <span class="mi">2</span><span class="p">,</span>  <span class="c1">// aka. AF_INET</span>
            <span class="p">};</span>
            <span class="n">bpf_bind</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">addr</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">sockaddr_in</span><span class="p">))</span>
            <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">user_ip4</span> <span class="o">=</span> <span class="n">localhost</span><span class="p">;</span>  <span class="c1">// 修改数据包目的地址</span>
            
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="c1">// 对于无法获取 Pod IP 的情况，目的地址为自定义地址</span>
            <span class="c1">// The reason we try the IP of the 127.128.0.0/20 segment instead of</span>
            <span class="c1">// using 127.0.0.1 directly is to avoid conflicts between the</span>
            <span class="c1">// quaternions of different Pods when the quaternions are</span>
            <span class="c1">// subsequently processed.</span>
            <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">user_ip4</span> <span class="o">=</span> <span class="n">bpf_htonl</span><span class="p">(</span><span class="mh">0x7f800000</span> <span class="o">|</span> <span class="p">(</span><span class="n">outip</span><span class="o">++</span><span class="p">));</span>  <span class="c1">// 修改数据包目的地址</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">outip</span> <span class="o">&gt;&gt;</span> <span class="mi">20</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">outip</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
        
        <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">user_port</span> <span class="o">=</span> <span class="n">bpf_htons</span><span class="p">(</span><span class="n">OUT_REDIRECT_PORT</span><span class="p">);</span>  <span class="c1">// 修改数据包目的端口，即 sidecar 的 15001 端口</span>
    <span class="p">}</span>
    
    <span class="c1">// ...</span>

    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>2. 在应用的 Socket 侧，当执行到 <code class="language-plaintext highlighter-rouge">sockops</code> eBPF 程序时，其会将当前 socket 和四元组保存在 <code class="language-plaintext highlighter-rouge">sock_pair_map</code> map 中，同时将四元组和对应流量的原始目的地址写入 <code class="language-plaintext highlighter-rouge">pair_original_dst</code> map 中。</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">int</span> <span class="nf">sockops_ipv4</span><span class="p">(</span><span class="k">struct</span> <span class="n">bpf_sock_ops</span> <span class="o">*</span><span class="n">skops</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">__u64</span> <span class="n">cookie</span> <span class="o">=</span> <span class="n">bpf_get_socket_cookie_ops</span><span class="p">(</span><span class="n">skops</span><span class="p">);</span>

    <span class="k">struct</span> <span class="n">pair</span> <span class="n">p</span><span class="p">;</span>
    <span class="n">set_ipv4</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">sip</span><span class="p">,</span> <span class="n">skops</span><span class="o">-&gt;</span><span class="n">local_ip4</span><span class="p">);</span>
    <span class="n">p</span><span class="p">.</span><span class="n">sport</span> <span class="o">=</span> <span class="n">bpf_htons</span><span class="p">(</span><span class="n">skops</span><span class="o">-&gt;</span><span class="n">local_port</span><span class="p">);</span>
    <span class="n">set_ipv4</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">dip</span><span class="p">,</span> <span class="n">skops</span><span class="o">-&gt;</span><span class="n">remote_ip4</span><span class="p">);</span>  <span class="c1">// 在应用侧 socket，拿到的目的地址和端口已经是发往 envoy 15001 的地址和端口</span>
    <span class="n">p</span><span class="p">.</span><span class="n">dport</span> <span class="o">=</span> <span class="n">skops</span><span class="o">-&gt;</span><span class="n">remote_port</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>

    <span class="k">struct</span> <span class="n">origin_info</span> <span class="o">*</span><span class="n">dst</span> <span class="o">=</span>
        <span class="n">bpf_map_lookup_elem</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cookie_original_dst</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">cookie</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">dst</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">origin_info</span> <span class="n">dd</span> <span class="o">=</span> <span class="o">*</span><span class="n">dst</span><span class="p">;</span>
        
        <span class="c1">// ...</span>
	    
        <span class="n">bpf_map_update_elem</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pair_original_dst</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">dd</span><span class="p">,</span> <span class="n">BPF_ANY</span><span class="p">);</span>
		<span class="n">bpf_sock_hash_update</span><span class="p">(</span><span class="n">skops</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">sock_pair_map</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">,</span> <span class="n">BPF_NOEXIST</span><span class="p">);</span> <span class="c1">// key 为四元组</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">skops</span><span class="o">-&gt;</span><span class="n">local_port</span> <span class="o">==</span> <span class="n">OUT_REDIRECT_PORT</span> <span class="o">||</span>
               <span class="n">skops</span><span class="o">-&gt;</span><span class="n">local_port</span> <span class="o">==</span> <span class="n">IN_REDIRECT_PORT</span> <span class="o">||</span>
               <span class="n">skops</span><span class="o">-&gt;</span><span class="n">remote_ip4</span> <span class="o">==</span> <span class="n">envoy_ip</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// 在 envoy 侧 socket，同样将其 socket 与对应的四元组写入 map</span>
        <span class="n">bpf_sock_hash_update</span><span class="p">(</span><span class="n">skops</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">sock_pair_map</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">,</span> <span class="n">BPF_NOEXIST</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="c1">// ...</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>值得注意的是，该段程序由于挂载在 sockops 挂载点，故会有多次执行，根据不同的执行侧可以分为：处理应用侧 socket 和 envoy 侧的 socket。
当在 Sidecar envoy 侧执行时，四元组的原地址和原端口对应 envoy:15001，目的地址和目的端口对应于应用。envoy 侧的 socket 对应于 <code class="language-plaintext highlighter-rouge">cookie_original_dst</code> map 中不存在任何原始地址信息，故会落入上述程序的第二段 if 语句，即只更新 <code class="language-plaintext highlighter-rouge">sock_pair_map</code> ，保存当前四元组与 envoy 侧 socket 的映射关系，便于后期转发流量时使用。</p>

<p>3. Envoy 接受到应有连接之后会调用 <code class="language-plaintext highlighter-rouge">get_sockopts</code> eBPF 程序获取当前连接的目的地址，该程序会依据四元组信息从 <code class="language-plaintext highlighter-rouge">pair_original_dast</code> map 中获取原始目的地址并保存。至此，出口向流量的连接建立完毕。</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="n">__section</span><span class="p">(</span><span class="s">"cgroup/getsockopt"</span><span class="p">)</span> <span class="kt">int</span> <span class="nf">mb_get_sockopt</span><span class="p">(</span><span class="k">struct</span> <span class="n">bpf_sockopt</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
	
    <span class="k">struct</span> <span class="n">pair</span> <span class="n">p</span><span class="p">;</span>
    <span class="n">memset</span><span class="p">(</span><span class="o">&amp;</span><span class="n">p</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">p</span><span class="p">));</span>
    <span class="n">p</span><span class="p">.</span><span class="n">dport</span> <span class="o">=</span> <span class="n">bpf_htons</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">src_port</span><span class="p">);</span>  <span class="c1">// 15001 端口，作为四元组的目的端口，顺序交互是为了能通过四元组查找出原始地址信息</span>
    <span class="n">p</span><span class="p">.</span><span class="n">sport</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">dst_port</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">origin_info</span> <span class="o">*</span><span class="n">origin</span><span class="p">;</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">family</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="mi">2</span><span class="p">:</span> <span class="c1">// ipv4</span>
        <span class="n">set_ipv4</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">dip</span><span class="p">,</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">src_ip4</span><span class="p">);</span>  <span class="c1">// envoy 地址，作为四元组的目的地址</span>
        <span class="n">set_ipv4</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">sip</span><span class="p">,</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">dst_ip4</span><span class="p">);</span>
        <span class="c1">// 四元组准备完毕</span>
        
        <span class="c1">// 根据四元组获取上一步中保存的原始目的地址 </span>
        <span class="n">origin</span> <span class="o">=</span> <span class="n">bpf_map_lookup_elem</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pair_original_dst</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">origin</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// 重写当前 socket</span>
            <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">optlen</span> <span class="o">=</span> <span class="p">(</span><span class="n">__s32</span><span class="p">)</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">sockaddr_in</span><span class="p">);</span>
            <span class="k">if</span> <span class="p">((</span><span class="kt">void</span> <span class="o">*</span><span class="p">)((</span><span class="k">struct</span> <span class="n">sockaddr_in</span> <span class="o">*</span><span class="p">)</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">optval</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">&gt;</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">optval_end</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
            <span class="p">}</span>
            
            <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">retval</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
            
            <span class="k">struct</span> <span class="n">sockaddr_in</span> <span class="n">sa</span> <span class="o">=</span> <span class="p">{</span>
                <span class="p">.</span><span class="n">sin_family</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">family</span><span class="p">,</span>
                <span class="p">.</span><span class="n">sin_addr</span><span class="p">.</span><span class="n">s_addr</span> <span class="o">=</span> <span class="n">get_ipv4</span><span class="p">(</span><span class="n">origin</span><span class="o">-&gt;</span><span class="n">ip</span><span class="p">),</span>
                <span class="p">.</span><span class="n">sin_port</span> <span class="o">=</span> <span class="n">origin</span><span class="o">-&gt;</span><span class="n">port</span><span class="p">,</span>
            <span class="p">};</span>
            <span class="o">*</span><span class="p">(</span><span class="k">struct</span> <span class="n">sockaddr_in</span> <span class="o">*</span><span class="p">)</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">optval</span> <span class="o">=</span> <span class="n">sa</span><span class="p">;</span>  <span class="c1">// 写入请求选项的 buffer</span>
        <span class="p">}</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="mi">10</span><span class="p">:</span> <span class="c1">// ipv6</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>4. 在发送数据阶段，<code class="language-plaintext highlighter-rouge">redir</code> eBPF 程序会根据四元组信息，从 <code class="language-plaintext highlighter-rouge">sock_pair_map</code> 中读取到 Sidecar envoy 的 socket，并通过 <code class="language-plaintext highlighter-rouge">bpf_msg_redirect_hash</code> 直接对流量进行转发。</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__section</span><span class="p">(</span><span class="s">"sk_msg"</span><span class="p">)</span> <span class="kt">int</span> <span class="nf">mb_msg_redir</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_msg_md</span> <span class="o">*</span><span class="n">msg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">pair</span> <span class="n">p</span><span class="p">;</span>
    <span class="n">memset</span><span class="p">(</span><span class="o">&amp;</span><span class="n">p</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">p</span><span class="p">));</span>
    <span class="n">p</span><span class="p">.</span><span class="n">dport</span> <span class="o">=</span> <span class="n">bpf_htons</span><span class="p">(</span><span class="n">msg</span><span class="o">-&gt;</span><span class="n">local_port</span><span class="p">);</span>
    <span class="n">p</span><span class="p">.</span><span class="n">sport</span> <span class="o">=</span> <span class="n">msg</span><span class="o">-&gt;</span><span class="n">remote_port</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span> <span class="c1">// 目的端口 15001 作为四元组的原端口，为了获取四元组对应的 socket 信息</span>

    <span class="k">switch</span> <span class="p">(</span><span class="n">msg</span><span class="o">-&gt;</span><span class="n">family</span><span class="p">)</span> <span class="p">{</span>
<span class="cp">#if ENABLE_IPV4
</span>    <span class="k">case</span> <span class="mi">2</span><span class="p">:</span>
        <span class="c1">// ipv4</span>
        <span class="n">set_ipv4</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">dip</span><span class="p">,</span> <span class="n">msg</span><span class="o">-&gt;</span><span class="n">local_ip4</span><span class="p">);</span>
        <span class="n">set_ipv4</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">sip</span><span class="p">,</span> <span class="n">msg</span><span class="o">-&gt;</span><span class="n">remote_ip4</span><span class="p">);</span>
        <span class="k">break</span><span class="p">;</span>
<span class="cp">#endif
#if ENABLE_IPV6
</span>    <span class="k">case</span> <span class="mi">10</span><span class="p">:</span>
        <span class="c1">// ipv6 ...</span>
<span class="cp">#endif
</span>    <span class="p">}</span>

    <span class="kt">long</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">bpf_msg_redirect_hash</span><span class="p">(</span><span class="n">msg</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">sock_pair_map</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">,</span> <span class="n">BPF_F_INGRESS</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="入口流量处理">入口流量处理</h3>

<p>入口流量的处理与出口流量类似，只需将目的地址的端口改为 15006 即可。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-06-06/merbridge-inbound.png" alt="merbridge-inbound.png" /></p>

<p>由于 eBPF 程序全局生效，对于不为 Istio 所管理的 Pod，就不允许外部流量向其建立连接。所以 Merbridge 维护了一个 <code class="language-plaintext highlighter-rouge">local_pod_ips</code> 的 map（通过 Local IP Controller 更新）。当 Merbridge 在做入口流量处理时，若目的地址不在该 map 中，则不做任何处理。</p>

<p>当外部流量抵达一个 Pod 时，只要其目的地址的 Pod 在当前 Node 所维护的 <code class="language-plaintext highlighter-rouge">local_pod_ips</code> 之中，并且不为当前处理 Pod 时，才需要将流量重定向到 Envoy 的 15006 端口。具体过程如下，主要还是修改流量的目的地址，并记录原始地址信息。其余的流程同出口流量处理，不再赘述。</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">int</span> <span class="nf">tcp_connect4</span><span class="p">(</span><span class="k">struct</span> <span class="n">bpf_sock_addr</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">uid</span> <span class="o">!=</span> <span class="n">SIDECAR_USER_ID</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// 见上文</span>
        <span class="c1">// ...</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">__u32</span> <span class="n">_dst_ip</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
        <span class="n">set_ipv4</span><span class="p">(</span><span class="n">_dst_ip</span><span class="p">,</span> <span class="n">dst_ip</span><span class="p">);</span>
        <span class="k">struct</span> <span class="n">pod_config</span> <span class="o">*</span><span class="n">pod</span> <span class="o">=</span> <span class="n">bpf_map_lookup_elem</span><span class="p">(</span><span class="o">&amp;</span><span class="n">local_pod_ips</span><span class="p">,</span> <span class="n">_dst_ip</span><span class="p">);</span>
        <span class="c1">// 若目的地址非本地 Node 中的 Pod IP，则跳过处理</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">pod</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>

        <span class="c1">// 目的地址在本地，但并非当前 Pod</span>
        <span class="c1">// 记录原始目的地址信息，以便后续修改数据包信息</span>
        <span class="k">struct</span> <span class="n">origin_info</span> <span class="n">origin</span><span class="p">;</span>
        <span class="n">memset</span><span class="p">(</span><span class="o">&amp;</span><span class="n">origin</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">origin</span><span class="p">));</span>
        <span class="n">set_ipv4</span><span class="p">(</span><span class="n">origin</span><span class="p">.</span><span class="n">ip</span><span class="p">,</span> <span class="n">dst_ip</span><span class="p">);</span>
        <span class="n">origin</span><span class="p">.</span><span class="n">port</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">user_port</span><span class="p">;</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">curr_pod_ip</span><span class="p">)</span> <span class="p">{</span>
	        <span class="c1">// 对于目的地址非当前 Pod 的流量，需要重定向数据包端口</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">curr_pod_ip</span> <span class="o">!=</span> <span class="n">dst_ip</span><span class="p">)</span> <span class="p">{</span>
                <span class="cm">/* 根据各种 Exclude/Include Out Ports 信息来判断是否还进一步向后执行；
	               podConfig 中的各种 Pod 配置信息是经由 Local IP Controller 获取的。
	             */</span>
                 <span class="c1">// ...</span>
	            
                <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">user_port</span> <span class="o">=</span> <span class="n">bpf_htons</span><span class="p">(</span><span class="n">IN_REDIRECT_PORT</span><span class="p">);</span>  <span class="c1">// 修改目的端口为 15006</span>
            <span class="p">}</span>
            <span class="n">origin</span><span class="p">.</span><span class="n">flags</span> <span class="o">|=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
	        <span class="c1">// 若 Pod IP 获取失败，则使用传统方式获取 Pod IP</span>
            <span class="n">__u32</span> <span class="n">pid</span> <span class="o">=</span> <span class="n">bpf_get_current_pid_tgid</span><span class="p">()</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span> <span class="c1">// tgid</span>
            <span class="kt">void</span> <span class="o">*</span><span class="n">curr_ip</span> <span class="o">=</span> <span class="n">bpf_map_lookup_elem</span><span class="p">(</span><span class="o">&amp;</span><span class="n">process_ip</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">pid</span><span class="p">);</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">curr_ip</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="p">(</span><span class="n">__u32</span> <span class="o">*</span><span class="p">)</span><span class="n">curr_ip</span> <span class="o">!=</span> <span class="n">dst_ip</span><span class="p">)</span> <span class="p">{</span>
                    <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">user_port</span> <span class="o">=</span> <span class="n">bpf_htons</span><span class="p">(</span><span class="n">IN_REDIRECT_PORT</span><span class="p">);</span>  <span class="c1">// 修改目的端口为 15006</span>
                <span class="p">}</span>
                <span class="n">origin</span><span class="p">.</span><span class="n">flags</span> <span class="o">|=</span> <span class="mi">1</span><span class="p">;</span>
            <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
                <span class="c1">// 若 Pod IP 仍然获取失败，envoy 向自身 pod 发送了流量</span>
                <span class="n">origin</span><span class="p">.</span><span class="n">flags</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
                <span class="n">origin</span><span class="p">.</span><span class="n">pid</span> <span class="o">=</span> <span class="n">pid</span><span class="p">;</span>
                <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">user_port</span> <span class="o">=</span> <span class="n">bpf_htons</span><span class="p">(</span><span class="n">IN_REDIRECT_PORT</span><span class="p">);</span>  <span class="c1">// 修改目的端口为 15006</span>
            <span class="p">}</span>
        <span class="p">}</span>
        
        <span class="n">__u64</span> <span class="n">cookie</span> <span class="o">=</span> <span class="n">bpf_get_socket_cookie_addr</span><span class="p">(</span><span class="n">ctx</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">bpf_map_update_elem</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cookie_original_dst</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">cookie</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">origin</span><span class="p">,</span> <span class="n">BPF_NOEXIST</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="小结">小结</h2>

<p>作为一个使用 eBPF 替代 iptables，并且加速 Istio 流量路径的项目，其不会对 Istio 有任何侵略式的修改。在完全卸载 Merbridge 后，Istio 还能依然保持使用 iptables 作为流量的劫持手段。从使用 eBPF 替代 iptables DNAT 的能力来说，<code class="language-plaintext highlighter-rouge">ORIGINAL_DST</code> 概念是贯穿全文的核心，其本质上就是记录被拦截流量的原始目的地址。</p>

<p>Merbridge 项目的整体规模虽然不大，但是非常具备学习意义，可以作为一个很好理解  eBPF 工作机理的入手项目。</p>

<h2 id="reference">Reference</h2>

<ol>
  <li><a href="https://merbridge.io/docs/overview/">https://merbridge.io/docs/overview/</a></li>
  <li><a href="https://arthurchiao.art/blog/bpf-advanced-notes-5-zh/">https://arthurchiao.art/blog/bpf-advanced-notes-5-zh/</a></li>
  <li><a href="https://istio.io/latest/docs/ops/deployment/requirements/#pod-requirements">https://istio.io/latest/docs/ops/deployment/requirements/#pod-requirements</a></li>
  <li><a href="https://github.com/libbpf/bpftool/blob/main/docs/bpftool-cgroup.rst">https://github.com/libbpf/bpftool/blob/main/docs/bpftool-cgroup.rst</a></li>
  <li><a href="https://merbridge.io/blog/2022/03/01/merbridge-introduce/">https://merbridge.io/blog/2022/03/01/merbridge-introduce/</a></li>
</ol>]]></content><author><name>Your Name</name><email>shawnhxh@outlook.com</email></author><category term="post" /><category term="Istio" /><category term="Network" /><category term="eBPF" /><summary type="html"><![CDATA[本文代码基于 Merbridge HEAD c16cc43 展开。 简介 Merbridge 是基于 eBPF 实现的一套可用于服务网格中流量拦截与高性能转发的方案，其支持多种服务网格项目（Istio、Kuma、Linkerd 等）适配，本文只以 Istio Sidecar 模式为例展开。 具体来讲（以 Istio Sidecar 模式为例），下图为原始流量路径：]]></summary></entry><entry><title type="html">2023 年度总结</title><link href="https://shawnh2.github.io/post/2024/02/05/2023-summary.html" rel="alternate" type="text/html" title="2023 年度总结" /><published>2024-02-05T00:00:00+08:00</published><updated>2024-02-05T00:00:00+08:00</updated><id>https://shawnh2.github.io/post/2024/02/05/2023-summary</id><content type="html" xml:base="https://shawnh2.github.io/post/2024/02/05/2023-summary.html"><![CDATA[<p>与其说这是 2023 年的年度总结，不如说这是癸卯年的年度总结，鉴于并不是在公历新年写的。想着既然在 GitHub 开了自己的博客，那就将就着碎碎念一下吧。（点了根烟，开始发挥</p>

<h2 id="博客初衷">博客初衷</h2>

<p>从今年五月份的时候开始搭建的这个博客平台，没有用自己服务器，也没有申请专属的域名，而是图着省事直接用 GitHub.io 来的。</p>

<p>当时这个时间节点，是听到左耳朵耗子叔离世🕯️的消息，便开始着手搭建的。想着人活着并非永恒，总得留下点什么东西，而我对于这“留下的东西”的理解，就是对“永恒”的理解。</p>

<p>我最喜欢的耗子叔的一篇文章，就是<a href="https://coolshell.cn/articles/20276.html">《别让自己”墙“了自己》</a>。因为真实，所以喜欢；因为喜欢，所以历历在目。不言而喻。</p>

<!--more-->

<h2 id="实习历程">实习历程</h2>

<p>作为一名今年要参加秋招的研究生来说，我参加实习是“按部就班”来的（虽然背着实验室导师，但我相信他心里也清楚，没有多教唆过我罢了）。</p>

<p>今年早春进行的第一段实习，也是人生当中第一次参加实习，是在上海的七牛云。老许（七牛云 CEO 许式伟，我们都尊称他为老许）作为国内最早以 Golang 起家创业的，我一个 Gopher 多少也是抱着“朝圣”心态来的。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-02-05/qiniu.jpg" alt="qiniu" /></p>
<center>它甚至用的是 Golang 的 `:=` 变量声明方式</center>

<p>这段实习，让我收获最多的是对一个公司或者产品理念理解上的进步。鉴于实习报酬丰富，我便多留了一段时间，这一呆就是四个月。本想着等夏季试试暑期实习的机会，因为回杭州不太方便，所以跟同门的交流就变少了，而等我意识到这个问题的时候，时间已经过去大半了。秉持着“宁缺毋滥”，“将错就错”的精神，我朝着“暑期实习”的方向背道而驰。</p>

<p>七月也是我离开七牛云的时候。那时候发现，自己距离上次来到上海已经十年了（十年前好像才刚上初中），依稀记得那时陆家嘴的上海中心大厦还被调侃为“搅蛋机”。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-02-05/2013-shanghai.jpg" alt="2013 shanghai" /></p>
<center>2013 年的上海陆家嘴</center>

<p>紧接着的第二段实习是在 <a href="https://www.greptime.com/">Greptime</a>，一家做开源 TSDB 的初创公司，我在其中负责 PaaS 层面相关的工作。因为在杭州，所以基本上每周末都能回学校，与同门交流的机会自然就多了起来。</p>

<p>这段实习期内，我做过最有趣的一件事就是 GreptimeDB 与 ApeCloud 家 <a href="https://kubeblocks.io/">KubeBlocks</a> 的一次开源联动，接入工作自己<a href="https://shawnh2.github.io/post/2023/08/28/greptimedb-x-kubeblocks.html">汇总成了一篇博客</a>挂在了本站，当然也被 Greptime 和 ApeCloud 家的公众号相继转发，这也应该是我第一次在公众号上崭露头角。只可惜线下的 Meetup 是在北京办的，没能到现场。</p>

<p>第二次发在 Greptime 公众号的推文是讲 VPA 的（改编自<a href="https://shawnh2.github.io/post/2023/09/30/vpa-in-autoscaler.html">我另一篇博客</a>），当时在做这方面的调研，顺手成章。而令我感到最神奇的一个地方是，自家公众号推文并没有许多的浏览量，而是其他各大 K8s 公众号相继转发，于是草船借箭般的给自家公众号涨了一波粉。某天，也看到了 <a href="https://twitter.com/xu_paco">@Paco</a> 大佬对我这篇博客的推荐，倍感欣慰。</p>

<center>
<img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-02-05/vpa-recommend.jpg" width="50%" height="auto" />
</center>

<p>Greptime 虽然是初创公司，但实力不容小觑，期间，我接触到了许多 DBaaS 领域相关的知识，我虽不从事数据库的内核开发，但从同事们的沟通讨论中也是耳濡目染。我的第二段实习一直到十二月的最后一天结束，秋招跟它同时进行，本来打算转正留下的，但是出于个人发展的考虑，还是选择签了国内某云厂商，另谋他就先。</p>

<h2 id="开源起航">开源起航</h2>

<p>2023 年是我开启开源项目贡献的元年。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-02-05/2023-commit.jpg" alt="2023 commits" /></p>

<p>一月份时，最开始我是跟着雨哥（花名于雨）在 <a href="https://github.com/apache/dubbo-go-pixiu">Dubbo-go-pixiu</a> 做贡献的，完善网关的相关功能，后来又被“抓到” <a href="https://github.com/arana-db">arana 社区</a>。我在这两个项目间来回游走直到六月。</p>

<p>三月份时，了解到了 <a href="https://github.com/envoyproxy/gateway">Envoy Gateway</a> 和 <a href="https://github.com/kubernetes-sigs/gateway-api">Gateway API</a> 这两个开源项目，便开始活跃其中。直到十月份时，我被邀请加入 Envoy 并成为了 Envoy Gateway 项目的 reviewer。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-02-05/join-envoy.png" alt="join envoy" /></p>

<p>今年我虽然只去了 KubeCon 上海，但我的 GitHub 头像却替我漂流过海到了 KubeCon 欧洲和 KubeCon 北美。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-02-05/kubecon-eu.jpg" alt="kubecon 2023 eu" /></p>
<center>KubeCon Europe 2023</center>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-02-05/kubecon-na.jpg" alt="kubecon 2023 na" /></p>
<center>KubeCon North America 2023</center>

<p>其他还有许多零零散散的贡献，不再展开细说。</p>

<center>
<img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2024-02-05/kb-contrib.jpg" width="50%" height="auto" />
</center>

<p>开源来说对我意味着什么？是我今年开始从事开源以来一直思考的一个问题。我没有一次性想出一个回答，而是随着时间的流逝，阶段性的问一问自己。</p>

<ul>
  <li>一月份时，它对我来说是机遇，我可以凭借它在秋招中脱颖而出；</li>
  <li>五月份时，它对我来说是任务，我要完成它以丰富我的简历；</li>
  <li>十月份时，它对我来说是责任，我要尽到 reviewer 所承担的义务；</li>
  <li>翌年一月，它对我来说是习惯，解答 Issue 或提交 PR 已成家常便饭。</li>
</ul>

<h2 id="生活碎片">生活碎片</h2>

<p>今年开始了一段新的恋情。</p>

<p>今年读的几本书：</p>
<ul>
  <li>《鼠疫》 - 阿尔贝·加缪</li>
  <li>《一九八四》 - 乔治·奥威尔</li>
  <li>《围城》 - 钱钟书</li>
  <li>《边城》 - 沈从文</li>
  <li>《雪国》 - 川端康成</li>
  <li>《罗生门》 - 芥川龙之介</li>
  <li>《荒原狼》 - 德尔曼·黑塞</li>
  <li>《千里江山图》 - 孙甘露</li>
</ul>

<p>今年看的几部剧：</p>
<ul>
  <li>美剧《极品老妈》（Mom）第一季～第八季，下饭剧</li>
  <li>美剧《欢乐一家亲》（Frasier）第一季～第六季，下饭剧</li>
  <li>日漫《咒术回战》</li>
  <li>日漫《葬送的芙莉莲》</li>
  <li>日漫《进击的巨人》大结局</li>
  <li>美漫《瑞克和莫蒂》（Rick and Morty）第七季</li>
  <li>美漫《外星也难民》（Solar Opposites）第四季</li>
</ul>]]></content><author><name>Your Name</name><email>shawnhxh@outlook.com</email></author><category term="post" /><summary type="html"><![CDATA[与其说这是 2023 年的年度总结，不如说这是癸卯年的年度总结，鉴于并不是在公历新年写的。想着既然在 GitHub 开了自己的博客，那就将就着碎碎念一下吧。（点了根烟，开始发挥 博客初衷 从今年五月份的时候开始搭建的这个博客平台，没有用自己服务器，也没有申请专属的域名，而是图着省事直接用 GitHub.io 来的。 当时这个时间节点，是听到左耳朵耗子叔离世🕯️的消息，便开始着手搭建的。想着人活着并非永恒，总得留下点什么东西，而我对于这“留下的东西”的理解，就是对“永恒”的理解。 我最喜欢的耗子叔的一篇文章，就是《别让自己”墙“了自己》。因为真实，所以喜欢；因为喜欢，所以历历在目。不言而喻。]]></summary></entry><entry><title type="html">Autoscaler 中 VPA 的实现原理解析</title><link href="https://shawnh2.github.io/post/2023/09/30/vpa-in-autoscaler.html" rel="alternate" type="text/html" title="Autoscaler 中 VPA 的实现原理解析" /><published>2023-09-30T00:00:00+08:00</published><updated>2023-09-30T00:00:00+08:00</updated><id>https://shawnh2.github.io/post/2023/09/30/vpa-in-autoscaler</id><content type="html" xml:base="https://shawnh2.github.io/post/2023/09/30/vpa-in-autoscaler.html"><![CDATA[<p>Pod 自动垂直伸缩（Vertical Pod Autoscaler，VPA）是 K8s 中集群资源控制的重要一部分。它主要有两个目的：</p>

<ul>
  <li>通过自动化配置所需资源的方式来降低集群的维护成本</li>
  <li>提升集群资源的利用率，减少集群中容器发生 OOM 或 CPU 饥饿的风险</li>
</ul>

<p>本文以 VPA 为切入点，分析了 Autoscaler 和 Kubernetes In-Place 的 VPA 实现方式。</p>

<h2 id="autoscaler">Autoscaler</h2>

<blockquote>
  <p>此部分内容对应的代码基于 Autoscaler HEAD <a href="https://github.com/kubernetes/autoscaler/tree/fbe25e1708cef546e6b114e93b06f03346c39c24">fbe25e1</a>。</p>
</blockquote>

<p>Autoscaler 的 VPA 会根据 Pod 的真实用量来自动的调整 Pod 所需的资源值，它通过引入 <a href="https://github.com/kubernetes/autoscaler/blob/fbe25e1708cef546e6b114e93b06f03346c39c24/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L53">VerticalPodAutoscaler</a> API 资源来实现，该资源定义了匹配哪些 Pod（label selector）使用何种更新策略（update policy）去更新以何种方式（resources policy）计算的资源值。</p>

<p>Autoscaler 的 VPA 由以下模块配合实现：</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-09-30/overview.png" alt="overview" /></p>

<ul>
  <li>Recommender，负责计算一个 VPA 对象中所匹配 Pod 的资源推荐值</li>
  <li>Admission Controller，负责拦截所有 Pod 的创建请求，并覆盖匹配到 VPA 对象的 Pod 资源值字段</li>
  <li>Updater，负责 Pod 资源的实时更新</li>
</ul>

<!--more-->

<h3 id="recommender">Recommender</h3>

<p>Autoscaler 的 VPA Recommender 以 Deployment 形式部署。并且在<code class="language-plaintext highlighter-rouge">VerticalPodAutoscaler</code>CRD 的 spec 中，可以通过<code class="language-plaintext highlighter-rouge">Recommenders</code>字段指定一个或多个 VPA Recommender（默认使用名为<code class="language-plaintext highlighter-rouge">default</code>的 VPA Recommender）。</p>

<p>VPA Recommender 对应的内部结构组成如下所示，其中：</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/recommender/main.go</span>

<span class="n">recommender</span> <span class="o">:=</span> <span class="n">routines</span><span class="o">.</span><span class="n">RecommenderFactory</span><span class="p">{</span>
    <span class="n">ClusterState</span><span class="o">:</span>                 <span class="n">clusterState</span><span class="p">,</span>
    <span class="n">ClusterStateFeeder</span><span class="o">:</span>           <span class="n">clusterStateFeeder</span><span class="p">,</span>
    <span class="n">ControllerFetcher</span><span class="o">:</span>            <span class="n">controllerFetcher</span><span class="p">,</span>
    <span class="n">CheckpointWriter</span><span class="o">:</span>             <span class="n">checkpoint</span><span class="o">.</span><span class="n">NewCheckpointWriter</span><span class="p">(</span><span class="n">clusterState</span><span class="p">,</span> <span class="n">vpa_clientset</span><span class="o">.</span><span class="n">NewForConfigOrDie</span><span class="p">(</span><span class="n">config</span><span class="p">)</span><span class="o">.</span><span class="n">AutoscalingV1</span><span class="p">()),</span>
    <span class="n">VpaClient</span><span class="o">:</span>                    <span class="n">vpa_clientset</span><span class="o">.</span><span class="n">NewForConfigOrDie</span><span class="p">(</span><span class="n">config</span><span class="p">)</span><span class="o">.</span><span class="n">AutoscalingV1</span><span class="p">(),</span>
    <span class="n">PodResourceRecommender</span><span class="o">:</span>       <span class="n">logic</span><span class="o">.</span><span class="n">CreatePodResourceRecommender</span><span class="p">(),</span>
    <span class="n">CheckpointsGCInterval</span><span class="o">:</span>        <span class="o">*</span><span class="n">checkpointsGCInterval</span><span class="p">,</span>  <span class="c">// 由 --checkpoints-gc-interval 参数指定，默认 10min</span>
    <span class="n">UseCheckpoints</span><span class="o">:</span>               <span class="n">useCheckpoints</span><span class="p">,</span>
    <span class="c">// ...</span>
<span class="p">}</span><span class="o">.</span><span class="n">Make</span><span class="p">()</span>
</code></pre></div></div>

<ul>
  <li><code class="language-plaintext highlighter-rouge">ClusterState</code>表示整个集群的资源状态，主要由 Pod 的状态和 VPA 的状态组成，充当了一个<strong>本地缓存</strong>的角色</li>
  <li><code class="language-plaintext highlighter-rouge">ClusterStateFeeder</code>定义了一系列集群资源状态的获取方式，这些获取的资源状态最终会存储在<code class="language-plaintext highlighter-rouge">ClusterState</code>中。它们包括但不限于：
    <ul>
      <li>Pod Lister，Pod 资源的 Informer，负责监听指定命名空间下（默认“所有”）非<code class="language-plaintext highlighter-rouge">pending</code>状态的 Pod</li>
      <li>VPA Lister，由 Autoscaler 定义的一个多版本（包括 v1、v1beta1、v1beta2 等）client，其中每个版本的 client 本质上对应的还是 k8s client，默认使用 v1 版本</li>
      <li>OOM Observer，本质上为一个缓冲为 5000（固定值）的通道，其存储了有关 OOM Event 的所有元数据信息，它通过监听指定命名空间下（默认“所有”）所有<code class="language-plaintext highlighter-rouge">reason=Evicted</code>类型的事件来获取数据并写入通道</li>
      <li>Controller Fetcher，各种 k8s 控制器的 Informer，监听了所有能够控制 Pod 资源调谐的控制器，包括 Deployment、DaemonSet、ReplicaSet、Job 等</li>
      <li>Metrics Client，作为 <a href="https://github.com/kubernetes-sigs/metrics-server">Metric Server</a> 的客户端以获取集群中 Pod 的 Metrics</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">ControllerFetcher</code>的定义同上述 Controller Fetcher</li>
  <li><code class="language-plaintext highlighter-rouge">VpaClient</code>的定义同上述 VPA Lister</li>
  <li><code class="language-plaintext highlighter-rouge">PodResourceRecommender</code>的定义见下文<code class="language-plaintext highlighter-rouge">Estimator</code>章节</li>
  <li>Checkpoints 是集群资源历史状态在本地磁盘的持久化存储，VPA Recommender 支持导入该数据以计算 Pod Resources 的推荐值</li>
</ul>

<h4 id="estimator">Estimator</h4>

<p>Pod Resources 的推荐值算子（Estimator）是由<code class="language-plaintext highlighter-rouge">PodResourceRecommender</code>函数初始化的。该函数初始化了三个 Estimator：<code class="language-plaintext highlighter-rouge">TargetEstimator</code>、<code class="language-plaintext highlighter-rouge">LowerBoundEstimator</code>和<code class="language-plaintext highlighter-rouge">UpperBoundEstimator</code>，分别表示推荐资源的目标值及可行域范围。Estimator 共有四种算子，如下图所示。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-09-30/estimator.png" alt="estimator" /></p>

<p>每个 Estimator 的计算顺序都为自顶向下。以<code class="language-plaintext highlighter-rouge">PercentileEstimator</code>为例，其作为<code class="language-plaintext highlighter-rouge">MarginEstimator</code>的<code class="language-plaintext highlighter-rouge">baseEstimator</code>使用，它会根据每个 container 的一组状态（作为分布）计算出 CPU 和内存在该分布 Percentile 位置（比如 95 分位点）的取值作为输出。CPU 和 Memory Peaks 的 Percentile 作为常数值出现，其中只有<code class="language-plaintext highlighter-rouge">targetCPUPercentile</code>可以配置，其他都是固定值。</p>

<h4 id="执行流程">执行流程</h4>

<p>Recommender 定期执行一次推荐资源值的计算，执行周期可由<code class="language-plaintext highlighter-rouge">--recommender-interval</code>参数指定，默认为 1 min。</p>

<p>执行期间，首先通过<code class="language-plaintext highlighter-rouge">ClusterStateFeeder</code>加载 VPA、Pod 资源和实时 Metrics 到<code class="language-plaintext highlighter-rouge">ClusterState</code>。以加载 VPA 资源为例，它是一个<strong>全量加载</strong>的过程：</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">feeder</span> <span class="o">*</span><span class="n">clusterStateFeeder</span><span class="p">)</span> <span class="n">LoadVPAs</span><span class="p">()</span> <span class="p">{</span>
	<span class="c">// 获取所有 VPA API 对象</span>
	<span class="n">allVpaCRDs</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">feeder</span><span class="o">.</span><span class="n">vpaLister</span><span class="o">.</span><span class="n">List</span><span class="p">(</span><span class="n">labels</span><span class="o">.</span><span class="n">Everything</span><span class="p">())</span>

	<span class="c">// 过滤出  Filter out VPAs that specified recommenders with names not equal to "default"</span>
	<span class="n">vpaCRDs</span> <span class="o">:=</span> <span class="n">filterVPAs</span><span class="p">(</span><span class="n">feeder</span><span class="p">,</span> <span class="n">allVpaCRDs</span><span class="p">)</span>

	<span class="c">// ... 根据 vpaCRDs 的结果，更新/增加/删除 ClusterState.Vpas</span>

	<span class="n">feeder</span><span class="o">.</span><span class="n">clusterState</span><span class="o">.</span><span class="n">ObservedVpas</span> <span class="o">=</span> <span class="n">vpaCRDs</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">filterVPAs</span><span class="p">(</span><span class="n">feeder</span> <span class="o">*</span><span class="n">clusterStateFeeder</span><span class="p">,</span> <span class="n">allVpaCRDs</span> <span class="p">[]</span><span class="o">*</span><span class="n">vpa_types</span><span class="o">.</span><span class="n">VerticalPodAutoscaler</span><span class="p">)</span> <span class="p">[]</span><span class="o">*</span><span class="n">vpa_types</span><span class="o">.</span><span class="n">VerticalPodAutoscaler</span> <span class="p">{</span>
	<span class="k">var</span> <span class="n">vpaCRDs</span> <span class="p">[]</span><span class="o">*</span><span class="n">vpa_types</span><span class="o">.</span><span class="n">VerticalPodAutoscaler</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">vpaCRD</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">allVpaCRDs</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">feeder</span><span class="o">.</span><span class="n">recommenderName</span> <span class="o">==</span> <span class="n">DefaultRecommenderName</span> <span class="p">{</span>  <span class="c">// 若 Recommender 名为 default，</span>
			<span class="c">// 则跳过那些指定了其他 Recommender 且不包含名为 default Recommender 的 VPA</span>
			<span class="c">// 对于未指定任何 Recommender 的 VPA，其默认使用 default Recommender</span>
			<span class="k">if</span> <span class="o">!</span><span class="n">implicitDefaultRecommender</span><span class="p">(</span><span class="n">vpaCRD</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">Recommenders</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">selectsRecommender</span><span class="p">(</span><span class="n">vpaCRD</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">Recommenders</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">feeder</span><span class="o">.</span><span class="n">recommenderName</span><span class="p">)</span> <span class="p">{</span>
				<span class="k">continue</span>
			<span class="p">}</span>
		<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
			<span class="c">// 对于其他指定名称的 Recommender，其不能作为任何无指定 VPA Recommenders 的默认 Recommender</span>
			<span class="k">if</span> <span class="n">implicitDefaultRecommender</span><span class="p">(</span><span class="n">vpaCRD</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">Recommenders</span><span class="p">)</span> <span class="p">{</span>
				<span class="k">continue</span>
			<span class="p">}</span>
			<span class="c">// 只有在 Recommender 与 VPA Recommenders 存在匹配时，该 VPA 才生效</span>
			<span class="k">if</span> <span class="o">!</span><span class="n">selectsRecommender</span><span class="p">(</span><span class="n">vpaCRD</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">Recommenders</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">feeder</span><span class="o">.</span><span class="n">recommenderName</span><span class="p">)</span> <span class="p">{</span>
				<span class="k">continue</span>
			<span class="p">}</span>
		<span class="p">}</span>
		<span class="n">vpaCRDs</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">vpaCRDs</span><span class="p">,</span> <span class="n">vpaCRD</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">vpaCRDs</span>
<span class="p">}</span>
</code></pre></div></div>

<p>最后 Recommender 调用<code class="language-plaintext highlighter-rouge">UpdateVPAs</code>方法计算 Pod Resources 的推荐值并写入至 VPA 对象。</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/recommender/routines/recommender.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">recommender</span><span class="p">)</span> <span class="n">UpdateVPAs</span><span class="p">()</span> <span class="p">{</span>
	<span class="c">// ...</span>

	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">observedVpa</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">r</span><span class="o">.</span><span class="n">clusterState</span><span class="o">.</span><span class="n">ObservedVpas</span> <span class="p">{</span>  <span class="c">// 通过 LoadVPAs() 获取</span>
		<span class="n">key</span> <span class="o">:=</span> <span class="n">model</span><span class="o">.</span><span class="n">VpaID</span><span class="p">{</span>
			<span class="n">Namespace</span><span class="o">:</span> <span class="n">observedVpa</span><span class="o">.</span><span class="n">Namespace</span><span class="p">,</span>
			<span class="n">VpaName</span><span class="o">:</span>   <span class="n">observedVpa</span><span class="o">.</span><span class="n">Name</span><span class="p">,</span>
		<span class="p">}</span>

		<span class="n">vpa</span><span class="p">,</span> <span class="n">found</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">clusterState</span><span class="o">.</span><span class="n">Vpas</span><span class="p">[</span><span class="n">key</span><span class="p">]</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">found</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="n">resources</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">podResourceRecommender</span><span class="o">.</span><span class="n">GetRecommendedPodResources</span><span class="p">(</span><span class="n">GetContainerNameToAggregateStateMap</span><span class="p">(</span><span class="n">vpa</span><span class="p">))</span>  <span class="c">// 通过 Estimator 计算资源推荐值</span>

		<span class="n">listOfResourceRecommendation</span> <span class="o">:=</span> <span class="n">logic</span><span class="o">.</span><span class="n">MapToListOfRecommendedContainerResources</span><span class="p">(</span><span class="n">resources</span><span class="p">)</span>
		<span class="n">vpa</span><span class="o">.</span><span class="n">UpdateRecommendation</span><span class="p">(</span><span class="n">listOfResourceRecommendation</span><span class="p">)</span>  <span class="c">// 将推荐值写入 VPA</span>

		<span class="n">hasMatchingPods</span> <span class="o">:=</span> <span class="n">vpa</span><span class="o">.</span><span class="n">PodCount</span> <span class="o">&gt;</span> <span class="m">0</span>
		<span class="n">vpa</span><span class="o">.</span><span class="n">UpdateConditions</span><span class="p">(</span><span class="n">hasMatchingPods</span><span class="p">)</span>  <span class="c">// 更新 VPA conditions</span>

		<span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">clusterState</span><span class="o">.</span><span class="n">RecordRecommendation</span><span class="p">(</span><span class="n">vpa</span><span class="p">,</span> <span class="n">time</span><span class="o">.</span><span class="n">Now</span><span class="p">())</span>  <span class="c">// 将推荐值也写入到 ClusterState</span>

		<span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">vpa_utils</span><span class="o">.</span><span class="n">UpdateVpaStatusIfNeeded</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">vpaClient</span><span class="o">.</span><span class="n">VerticalPodAutoscalers</span><span class="p">(</span><span class="n">vpa</span><span class="o">.</span><span class="n">ID</span><span class="o">.</span><span class="n">Namespace</span><span class="p">),</span> <span class="n">vpa</span><span class="o">.</span><span class="n">ID</span><span class="o">.</span><span class="n">VpaName</span><span class="p">,</span>
                                                    <span class="n">vpa</span><span class="o">.</span><span class="n">AsStatus</span><span class="p">()</span> <span class="c">/* new status */</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">observedVpa</span><span class="o">.</span><span class="n">Status</span> <span class="c">/* old status */</span><span class="p">)</span>  <span class="c">// 更新 VPA Status</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="admission-controller">Admission Controller</h3>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-09-30/admission.png" alt="admission" /></p>

<p>Autoscaler 的 VPA Admission Controller 以 Deplyments 形式部署，并默认在<code class="language-plaintext highlighter-rouge">kube-system</code>命名空间下以名为<code class="language-plaintext highlighter-rouge">vpa-webhook</code>的 Service 提供 HTTPS 服务。
Admission Controller 的整体执行过程如下代码所示（大致过程可参考上图）：</p>

<ul>
  <li>它主要负责创建并启动 Admission Server</li>
  <li>注册 Pod 和 VPA 资源的 Handler，负责<strong>处理各自对应资源的创建请求</strong></li>
  <li>注册 Calculator，以获取 Recommender 中计算的资源推荐值；这里注册了两个 Calculator，其中第一个就是从 VPA CRD 的 Recommend 字段获取推荐值，第二个是为每个 Pod 都添加一个<code class="language-plaintext highlighter-rouge">vpaObservedContainers: {container_name1, ...}</code>风格的 annotations</li>
  <li>注册 Webhook，以拦截相关资源的创建请求，详细描述见下文</li>
</ul>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/admission-controller/main.go</span>

<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
	<span class="c">// ...</span>

	<span class="n">vpaClient</span> <span class="o">:=</span> <span class="n">vpa_clientset</span><span class="o">.</span><span class="n">NewForConfigOrDie</span><span class="p">(</span><span class="n">config</span><span class="p">)</span>
	<span class="n">vpaLister</span> <span class="o">:=</span> <span class="n">vpa_api_util</span><span class="o">.</span><span class="n">NewVpasLister</span><span class="p">(</span><span class="n">vpaClient</span><span class="p">,</span> <span class="nb">make</span><span class="p">(</span><span class="k">chan</span> <span class="k">struct</span><span class="p">{}),</span> <span class="o">*</span><span class="n">vpaObjectNamespace</span><span class="p">)</span>  <span class="c">// 同上文 VPA Lister</span>

	<span class="n">kubeClient</span> <span class="o">:=</span> <span class="n">kube_client</span><span class="o">.</span><span class="n">NewForConfigOrDie</span><span class="p">(</span><span class="n">config</span><span class="p">)</span>
	<span class="n">factory</span> <span class="o">:=</span> <span class="n">informers</span><span class="o">.</span><span class="n">NewSharedInformerFactory</span><span class="p">(</span><span class="n">kubeClient</span><span class="p">,</span> <span class="c">/* defaultResyncPeriod=10min */</span><span class="p">)</span>
	<span class="n">targetSelectorFetcher</span> <span class="o">:=</span> <span class="n">target</span><span class="o">.</span><span class="n">NewVpaTargetSelectorFetcher</span><span class="p">(</span><span class="n">config</span><span class="p">,</span> <span class="n">kubeClient</span><span class="p">,</span> <span class="n">factory</span><span class="p">)</span>  <span class="c">// 同上文的 Controller Fetcher</span>

        <span class="n">recommendationProvider</span> <span class="o">:=</span> <span class="n">recommendation</span><span class="o">.</span><span class="n">NewProvider</span><span class="p">(</span><span class="c">/* ... */</span><span class="p">)</span>  <span class="c">// 推荐资源值的提供方</span>
	<span class="n">vpaMatcher</span> <span class="o">:=</span> <span class="n">vpa</span><span class="o">.</span><span class="n">NewMatcher</span><span class="p">(</span><span class="n">vpaLister</span><span class="p">,</span> <span class="n">targetSelectorFetcher</span><span class="p">)</span>

	<span class="n">calculators</span> <span class="o">:=</span> <span class="p">[]</span><span class="n">patch</span><span class="o">.</span><span class="n">Calculator</span><span class="p">{</span><span class="n">patch</span><span class="o">.</span><span class="n">NewResourceUpdatesCalculator</span><span class="p">(</span><span class="n">recommendationProvider</span><span class="p">),</span> <span class="n">patch</span><span class="o">.</span><span class="n">NewObservedContainersCalculator</span><span class="p">()}</span>
	<span class="n">as</span> <span class="o">:=</span> <span class="n">logic</span><span class="o">.</span><span class="n">NewAdmissionServer</span><span class="p">(</span><span class="c">/* ... */</span><span class="p">,</span> <span class="n">vpaMatcher</span><span class="p">,</span> <span class="n">calculators</span><span class="p">)</span>  <span class="c">// 创建 Server</span>
                     <span class="err">\</span>
                      <span class="err">\</span>
                       <span class="n">as</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">AdmissionServer</span><span class="p">{</span><span class="c">/* ... */</span><span class="p">,</span> <span class="k">map</span><span class="p">[</span><span class="n">metav1</span><span class="o">.</span><span class="n">GroupResource</span><span class="p">]</span><span class="n">resource</span><span class="o">.</span><span class="n">Handler</span><span class="p">{}}</span>
                       <span class="n">as</span><span class="o">.</span><span class="n">RegisterResourceHandler</span><span class="p">(</span><span class="n">pod</span><span class="o">.</span><span class="n">NewResourceHandler</span><span class="p">(</span><span class="c">/* ... */</span><span class="p">,</span> <span class="n">vpaMatcher</span><span class="p">,</span> <span class="n">calculators</span><span class="p">))</span>  <span class="c">// 注册 Resource Handler</span>
                       <span class="n">as</span><span class="o">.</span><span class="n">RegisterResourceHandler</span><span class="p">(</span><span class="n">vpa</span><span class="o">.</span><span class="n">NewResourceHandler</span><span class="p">(</span><span class="c">/* ... */</span><span class="p">))</span>

	<span class="n">http</span><span class="o">.</span><span class="n">HandleFunc</span><span class="p">(</span><span class="s">"/"</span><span class="p">,</span> <span class="k">func</span><span class="p">(</span><span class="n">w</span> <span class="n">http</span><span class="o">.</span><span class="n">ResponseWriter</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Request</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">as</span><span class="o">.</span><span class="n">Serve</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span>  <span class="c">// 处理拦截到的请求</span>
	<span class="p">})</span>
	<span class="n">server</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">http</span><span class="o">.</span><span class="n">Server</span><span class="p">{</span>
		<span class="n">Addr</span><span class="o">:</span>      <span class="n">fmt</span><span class="o">.</span><span class="n">Sprintf</span><span class="p">(</span><span class="s">":%d"</span><span class="p">,</span> <span class="o">*</span><span class="n">port</span><span class="p">),</span>
		<span class="n">TLSConfig</span><span class="o">:</span> <span class="n">configTLS</span><span class="p">(</span><span class="n">certs</span><span class="o">.</span><span class="n">serverCert</span><span class="p">,</span> <span class="n">certs</span><span class="o">.</span><span class="n">serverKey</span><span class="p">),</span>
	<span class="p">}</span>

	<span class="k">go</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
		<span class="n">selfRegistration</span><span class="p">(</span><span class="n">kubeClient</span><span class="p">,</span> <span class="n">certs</span><span class="o">.</span><span class="n">caCert</span><span class="p">,</span> <span class="n">namespace</span><span class="p">,</span> <span class="o">*</span><span class="n">serviceName</span><span class="p">,</span> <span class="n">url</span><span class="p">,</span> <span class="o">*</span><span class="n">registerByURL</span><span class="p">,</span> <span class="kt">int32</span><span class="p">(</span><span class="o">*</span><span class="n">webhookTimeout</span><span class="p">))</span>  <span class="c">// 将自己注册为 MutatingAdmissionWebhook</span>
	<span class="p">}()</span>

	<span class="n">err</span> <span class="o">=</span> <span class="n">server</span><span class="o">.</span><span class="n">ListenAndServeTLS</span><span class="p">(</span><span class="s">""</span><span class="p">,</span> <span class="s">""</span><span class="p">)</span>  <span class="c">// 开启 HTTPS 服务</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="webhook-注册">Webhook 注册</h4>

<p>Admission Controller 通过<code class="language-plaintext highlighter-rouge">selfRegistration</code>函数将自己提供的服务注册为了一个<code class="language-plaintext highlighter-rouge">MutatingAdmissionWebhook</code>。观察该 Webhook 的配置可以发现，其只在对应 Pod 事件为 CREATE、对应 VPA 事件为 CREATE 或 UPDATE 时生效。</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/admission-controller/config.go</span>

<span class="k">func</span> <span class="n">selfRegistration</span><span class="p">(</span><span class="n">clientset</span> <span class="o">*</span><span class="n">kubernetes</span><span class="o">.</span><span class="n">Clientset</span><span class="p">,</span> <span class="n">caCert</span> <span class="p">[]</span><span class="kt">byte</span><span class="p">,</span> <span class="n">namespace</span><span class="p">,</span> <span class="n">serviceName</span><span class="p">,</span> <span class="n">url</span> <span class="kt">string</span><span class="p">,</span> <span class="n">registerByURL</span> <span class="kt">bool</span><span class="p">,</span> <span class="n">timeoutSeconds</span> <span class="kt">int32</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">time</span><span class="o">.</span><span class="n">Sleep</span><span class="p">(</span><span class="m">10</span> <span class="o">*</span> <span class="n">time</span><span class="o">.</span><span class="n">Second</span><span class="p">)</span>  <span class="c">// ...等会儿开始</span>
	<span class="n">client</span> <span class="o">:=</span> <span class="n">clientset</span><span class="o">.</span><span class="n">AdmissionregistrationV1</span><span class="p">()</span><span class="o">.</span><span class="n">MutatingWebhookConfigurations</span><span class="p">()</span>
	<span class="c">// 已有的 webhook 要删除重建</span>
	<span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">client</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">TODO</span><span class="p">(),</span> <span class="s">"vpa-webhook-config"</span><span class="p">,</span> <span class="n">metav1</span><span class="o">.</span><span class="n">GetOptions</span><span class="p">{})</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">err2</span> <span class="o">:=</span> <span class="n">client</span><span class="o">.</span><span class="n">Delete</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">TODO</span><span class="p">(),</span> <span class="s">"vpa-webhook-config"</span><span class="p">,</span> <span class="n">metav1</span><span class="o">.</span><span class="n">DeleteOptions</span><span class="p">{})</span>
	<span class="p">}</span>

	<span class="n">RegisterClientConfig</span> <span class="o">:=</span> <span class="n">admissionregistration</span><span class="o">.</span><span class="n">WebhookClientConfig</span><span class="p">{}</span>
        <span class="n">RegisterClientConfig</span><span class="o">.</span><span class="n">Service</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">admissionregistration</span><span class="o">.</span><span class="n">ServiceReference</span><span class="p">{</span>  <span class="c">// 与 webhook 建立 TLS 连接的服务信息</span>
            <span class="n">Namespace</span><span class="o">:</span> <span class="n">namespace</span><span class="p">,</span>
            <span class="n">Name</span><span class="o">:</span>      <span class="n">serviceName</span><span class="p">,</span>
        <span class="p">}</span>

	<span class="n">RegisterClientConfig</span><span class="o">.</span><span class="n">CABundle</span> <span class="o">=</span> <span class="n">caCert</span>
	<span class="n">webhookConfig</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">admissionregistration</span><span class="o">.</span><span class="n">MutatingWebhookConfiguration</span><span class="p">{</span>
		<span class="n">ObjectMeta</span><span class="o">:</span> <span class="n">metav1</span><span class="o">.</span><span class="n">ObjectMeta</span><span class="p">{</span>
			<span class="n">Name</span><span class="o">:</span> <span class="s">"vpa-webhook-config"</span><span class="p">,</span>
		<span class="p">},</span>
		<span class="n">Webhooks</span><span class="o">:</span> <span class="p">[]</span><span class="n">admissionregistration</span><span class="o">.</span><span class="n">MutatingWebhook</span><span class="p">{</span>
			<span class="p">{</span>
				<span class="n">Name</span><span class="o">:</span>                    <span class="s">"vpa.k8s.io"</span><span class="p">,</span>
				<span class="n">AdmissionReviewVersions</span><span class="o">:</span> <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">"v1"</span><span class="p">},</span>
				<span class="n">Rules</span><span class="o">:</span> <span class="p">[]</span><span class="n">admissionregistration</span><span class="o">.</span><span class="n">RuleWithOperations</span><span class="p">{</span>
					<span class="p">{</span>
						<span class="n">Operations</span><span class="o">:</span> <span class="p">[]</span><span class="n">admissionregistration</span><span class="o">.</span><span class="n">OperationType</span><span class="p">{</span><span class="s">"CREATE"</span><span class="p">},</span>
						<span class="n">Rule</span><span class="o">:</span> <span class="n">admissionregistration</span><span class="o">.</span><span class="n">Rule</span><span class="p">{</span>
							<span class="n">APIGroups</span><span class="o">:</span>   <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">""</span><span class="p">},</span>
							<span class="n">APIVersions</span><span class="o">:</span> <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">"v1"</span><span class="p">},</span>
							<span class="n">Resources</span><span class="o">:</span>   <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">"pods"</span><span class="p">},</span>
						<span class="p">},</span>
					<span class="p">},</span>
					<span class="p">{</span>
						<span class="n">Operations</span><span class="o">:</span> <span class="p">[]</span><span class="n">admissionregistration</span><span class="o">.</span><span class="n">OperationType</span><span class="p">{</span><span class="s">"CREATE"</span><span class="p">,</span> <span class="s">"UPDATE"</span><span class="p">},</span>
						<span class="n">Rule</span><span class="o">:</span> <span class="n">admissionregistration</span><span class="o">.</span><span class="n">Rule</span><span class="p">{</span>
							<span class="n">APIGroups</span><span class="o">:</span>   <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">"autoscaling.k8s.io"</span><span class="p">},</span>
							<span class="n">APIVersions</span><span class="o">:</span> <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">"*"</span><span class="p">},</span>
							<span class="n">Resources</span><span class="o">:</span>   <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">"verticalpodautoscalers"</span><span class="p">},</span>
						<span class="p">},</span>
					<span class="p">},</span>
				<span class="p">},</span>
				<span class="n">FailurePolicy</span><span class="o">:</span>  <span class="s">"Ignore"</span><span class="p">,</span>
				<span class="n">ClientConfig</span><span class="o">:</span>   <span class="n">RegisterClientConfig</span><span class="p">,</span>
				<span class="n">SideEffects</span><span class="o">:</span>    <span class="s">"None"</span><span class="p">,</span>
				<span class="n">TimeoutSeconds</span><span class="o">:</span> <span class="o">&amp;</span><span class="n">timeoutSeconds</span><span class="p">,</span>
			<span class="p">},</span>
		<span class="p">},</span>
	<span class="p">}</span>
	<span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">client</span><span class="o">.</span><span class="n">Create</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">TODO</span><span class="p">(),</span> <span class="n">webhookConfig</span><span class="p">,</span> <span class="n">metav1</span><span class="o">.</span><span class="n">CreateOptions</span><span class="p">{})</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="admit">Admit</h4>

<p>Pod 的创建请求（或 VPA 的创建/更新请求）都会被上述<code class="language-plaintext highlighter-rouge">MutatingAdmissionWebhook</code>拦截并转发到 Admission Controller Server 提供的服务中，该 Server 通过<code class="language-plaintext highlighter-rouge">Serve</code>方法处理接收到的创建（或更新）请求。</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/admission-controller/logic/server.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">AdmissionServer</span><span class="p">)</span> <span class="n">Serve</span><span class="p">(</span><span class="n">w</span> <span class="n">http</span><span class="o">.</span><span class="n">ResponseWriter</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Request</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">var</span> <span class="n">body</span> <span class="p">[]</span><span class="kt">byte</span>  <span class="c">// 读取请求数据</span>
	<span class="k">if</span> <span class="n">r</span><span class="o">.</span><span class="n">Body</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">data</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">io</span><span class="o">.</span><span class="n">ReadAll</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">Body</span><span class="p">);</span> <span class="n">err</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">body</span> <span class="o">=</span> <span class="n">data</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="n">contentType</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">Header</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="s">"Content-Type"</span><span class="p">)</span>  <span class="c">// 请求体必须是 JSON 格式的</span>
	<span class="k">if</span> <span class="n">contentType</span> <span class="o">!=</span> <span class="s">"application/json"</span> <span class="p">{</span>
		<span class="k">return</span>
	<span class="p">}</span>

	<span class="n">reviewResponse</span><span class="p">,</span> <span class="n">status</span><span class="p">,</span> <span class="n">resource</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">admit</span><span class="p">(</span><span class="n">body</span><span class="p">)</span>  <span class="c">// 组织响应内容</span>
	<span class="n">ar</span> <span class="o">:=</span> <span class="n">v1</span><span class="o">.</span><span class="n">AdmissionReview</span><span class="p">{</span>
		<span class="n">Response</span><span class="o">:</span> <span class="n">reviewResponse</span><span class="p">,</span>
		<span class="n">TypeMeta</span><span class="o">:</span> <span class="n">metav1</span><span class="o">.</span><span class="n">TypeMeta</span><span class="p">{</span>
			<span class="n">Kind</span><span class="o">:</span>       <span class="s">"AdmissionReview"</span><span class="p">,</span>
			<span class="n">APIVersion</span><span class="o">:</span> <span class="s">"admission.k8s.io/v1"</span><span class="p">,</span>
		<span class="p">},</span>
	<span class="p">}</span>
	<span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">Marshal</span><span class="p">(</span><span class="n">ar</span><span class="p">)</span>
	<span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">w</span><span class="o">.</span><span class="n">Write</span><span class="p">(</span><span class="n">resp</span><span class="p">)</span>  <span class="c">// 写回响应</span>
<span class="p">}</span>
</code></pre></div></div>

<p>针对每一个请求的响应都是通过 Admission Server 的<code class="language-plaintext highlighter-rouge">admit</code>方法来构建的，<strong>响应体中的数据就是对该请求资源的</strong><a href="https://www.rfc-editor.org/rfc/rfc6902.html">JSON Patches</a>：</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/admission-controller/logic/server.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">AdmissionServer</span><span class="p">)</span> <span class="n">admit</span><span class="p">(</span><span class="n">data</span> <span class="p">[]</span><span class="kt">byte</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">AdmissionResponse</span><span class="p">,</span> <span class="n">metrics_admission</span><span class="o">.</span><span class="n">AdmissionStatus</span><span class="p">,</span> <span class="n">metrics_admission</span><span class="o">.</span><span class="n">AdmissionResource</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">response</span> <span class="o">:=</span> <span class="n">v1</span><span class="o">.</span><span class="n">AdmissionResponse</span><span class="p">{}</span>
	<span class="n">ar</span> <span class="o">:=</span> <span class="n">v1</span><span class="o">.</span><span class="n">AdmissionReview</span><span class="p">{}</span>
	<span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">Unmarshal</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ar</span><span class="p">)</span>  <span class="c">// 解析请求数据</span>

	<span class="k">var</span> <span class="n">patches</span> <span class="p">[]</span><span class="n">resource</span><span class="o">.</span><span class="n">PatchRecord</span>

	<span class="n">admittedGroupResource</span> <span class="o">:=</span> <span class="n">metav1</span><span class="o">.</span><span class="n">GroupResource</span><span class="p">{</span>
		<span class="n">Group</span><span class="o">:</span>    <span class="n">ar</span><span class="o">.</span><span class="n">Request</span><span class="o">.</span><span class="n">Resource</span><span class="o">.</span><span class="n">Group</span><span class="p">,</span>
		<span class="n">Resource</span><span class="o">:</span> <span class="n">ar</span><span class="o">.</span><span class="n">Request</span><span class="o">.</span><span class="n">Resource</span><span class="o">.</span><span class="n">Resource</span><span class="p">,</span>
	<span class="p">}</span>
	<span class="n">handler</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">resourceHandlers</span><span class="p">[</span><span class="n">admittedGroupResource</span><span class="p">]</span>  <span class="c">// 获取该请求对应资源的 Handler</span>
	<span class="k">if</span> <span class="n">ok</span> <span class="p">{</span>
		<span class="n">patches</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">handler</span><span class="o">.</span><span class="n">GetPatches</span><span class="p">(</span><span class="n">ar</span><span class="o">.</span><span class="n">Request</span><span class="p">)</span>  <span class="c">// 返回对应资源的 json patches</span>
		<span class="n">resource</span> <span class="o">=</span> <span class="n">handler</span><span class="o">.</span><span class="n">AdmissionResource</span><span class="p">()</span>
	<span class="p">}</span>

	<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">patches</span><span class="p">)</span> <span class="o">&gt;</span> <span class="m">0</span> <span class="p">{</span>
		<span class="n">patch</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">Marshal</span><span class="p">(</span><span class="n">patches</span><span class="p">)</span>  <span class="c">// 编码响应数据</span>
		<span class="n">response</span><span class="o">.</span><span class="n">PatchType</span> <span class="o">=</span> <span class="s">"JSONPatch"</span>
		<span class="n">response</span><span class="o">.</span><span class="n">Patch</span> <span class="o">=</span> <span class="n">patch</span>
	<span class="p">}</span>

	<span class="c">// ... 计算 status</span>

	<span class="k">return</span> <span class="o">&amp;</span><span class="n">response</span><span class="p">,</span> <span class="n">status</span><span class="p">,</span> <span class="n">resource</span>
<span class="p">}</span>
</code></pre></div></div>

<p>以 Pod 资源为例，其 Handler 对应的<code class="language-plaintext highlighter-rouge">GetPatches</code>方法如下：</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/admission-controller/resource/pod/handler.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">h</span> <span class="o">*</span><span class="n">resourceHandler</span><span class="p">)</span> <span class="n">GetPatches</span><span class="p">(</span><span class="n">ar</span> <span class="o">*</span><span class="n">admissionv1</span><span class="o">.</span><span class="n">AdmissionRequest</span><span class="p">)</span> <span class="p">([]</span><span class="n">resource_admission</span><span class="o">.</span><span class="n">PatchRecord</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">raw</span><span class="p">,</span> <span class="n">namespace</span> <span class="o">:=</span> <span class="n">ar</span><span class="o">.</span><span class="n">Object</span><span class="o">.</span><span class="n">Raw</span><span class="p">,</span> <span class="n">ar</span><span class="o">.</span><span class="n">Namespace</span>
	<span class="n">pod</span> <span class="o">:=</span> <span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">{}</span>
	<span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">Unmarshal</span><span class="p">(</span><span class="n">raw</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">pod</span><span class="p">)</span>
	<span class="c">// ...</span>

	<span class="n">controllingVpa</span> <span class="o">:=</span> <span class="n">h</span><span class="o">.</span><span class="n">vpaMatcher</span><span class="o">.</span><span class="n">GetMatchingVPA</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pod</span><span class="p">)</span>  <span class="c">// 获取控制该 Pod 的 VPA 资源</span>
	<span class="n">patches</span> <span class="o">:=</span> <span class="p">[]</span><span class="n">resource_admission</span><span class="o">.</span><span class="n">PatchRecord</span><span class="p">{}</span>

	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">c</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">h</span><span class="o">.</span><span class="n">patchCalculators</span> <span class="p">{</span>
		<span class="n">partialPatches</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">CalculatePatches</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pod</span><span class="p">,</span> <span class="n">controllingVpa</span><span class="p">)</span>  <span class="c">// 根据每种 calculator 的计算方式返回 patch</span>
		<span class="n">patches</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">patches</span><span class="p">,</span> <span class="n">partialPatches</span><span class="o">...</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">patches</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="updater">Updater</h3>

<p>Autoscaler 的 Updater 以 Deployment 形式默认在<code class="language-plaintext highlighter-rouge">kube-system</code>命名空间下部署。Updater 用于决定哪些 Pods 需要根据 Recommender 计算的值调整资源，Updater 对 Pod 的资源调整采用<strong>驱逐再重建</strong>的方式（同时也考虑了 <a href="https://kubernetes.io/docs/concepts/workloads/pods/disruptions/">Pod Disruption Budget</a>）。<strong>Updater 自身并没有资源更新的能力，而是只负责驱逐 Pod，再次创建 Pod 时资源更新的能力则依赖于 Admission Controller</strong>。</p>

<p>Updater 的关键结构如下所示。它是一个无限运行的循环，资源更新的执行周期默认为 1 min。</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/updater/main.go</span>

<span class="n">updater</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">updater</span><span class="o">.</span><span class="n">NewUpdater</span><span class="p">(</span>
    <span class="n">kubeClient</span><span class="p">,</span>
    <span class="n">vpaClient</span><span class="p">,</span>
    <span class="o">*</span><span class="n">minReplicas</span><span class="p">,</span>                        <span class="c">// default=2</span>
    <span class="o">*</span><span class="n">evictionToleranceFraction</span><span class="p">,</span>          <span class="c">// default=0.5，在多于一个 Pod 时，能够被驱逐的 Pod 比例</span>
    <span class="o">*</span><span class="n">useAdmissionControllerStatus</span><span class="p">,</span>       <span class="c">// 只在 admission controller 状态正常时才启用 updater</span>
    <span class="n">admissionControllerStatusNamespace</span><span class="p">,</span>  <span class="c">// admission controller 所在的命名空间，默认为 kube-system</span>
    <span class="c">/* evictionAdmission: */</span> <span class="no">nil</span><span class="p">,</span>
    <span class="n">vpa_api_util</span><span class="o">.</span><span class="n">NewCappingRecommendationProcessor</span><span class="p">(),</span>  <span class="c">// 负责调整 Pod 内的资源值，使其遵循 VPA 的 Resource Policy 和容器 Limit</span>
    <span class="n">targetSelectorFetcher</span><span class="p">,</span>
    <span class="n">priority</span><span class="o">.</span><span class="n">NewProcessor</span><span class="p">(),</span>  <span class="c">// 处理驱逐优先级相关逻辑</span>
    <span class="o">*</span><span class="n">vpaObjectNamespace</span><span class="p">,</span>      <span class="c">// 查询 VPA 对象的命名空间，默认所有</span>
    <span class="c">// ...</span>
<span class="p">)</span>
</code></pre></div></div>

<p>每次资源更新调用的都是 Updater 的<code class="language-plaintext highlighter-rouge">RunOnce</code>方法，该方法会<strong>枚举每个 VPA 资源及其对应的 Pods，筛选出在当前 VPA 中需要进行资源更新的 Pods 并对它们逐一进行驱逐</strong>。</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/updater/logic/updater.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">u</span> <span class="o">*</span><span class="n">updater</span><span class="p">)</span> <span class="n">RunOnce</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">u</span><span class="o">.</span><span class="n">useAdmissionControllerStatus</span> <span class="p">{</span>
		<span class="n">isValid</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">u</span><span class="o">.</span><span class="n">statusValidator</span><span class="o">.</span><span class="n">IsStatusValid</span><span class="p">(</span><span class="n">status</span><span class="o">.</span><span class="n">AdmissionControllerStatusTimeout</span><span class="p">)</span>  <span class="c">// 检查 Admission Controller 状态是否正常</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">isValid</span> <span class="p">{</span>
			<span class="k">return</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="n">vpaList</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">u</span><span class="o">.</span><span class="n">vpaLister</span><span class="o">.</span><span class="n">List</span><span class="p">(</span><span class="n">labels</span><span class="o">.</span><span class="n">Everything</span><span class="p">())</span>  <span class="c">// 列出所有 VPA 资源</span>
	<span class="n">vpas</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="o">*</span><span class="n">vpa_api_util</span><span class="o">.</span><span class="n">VpaWithSelector</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">vpa</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">vpaList</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">vpa_api_util</span><span class="o">.</span><span class="n">GetUpdateMode</span><span class="p">(</span><span class="n">vpa</span><span class="p">)</span> <span class="o">!=</span> <span class="n">vpa_types</span><span class="o">.</span><span class="n">UpdateModeRecreate</span> <span class="o">&amp;&amp;</span>
			<span class="n">vpa_api_util</span><span class="o">.</span><span class="n">GetUpdateMode</span><span class="p">(</span><span class="n">vpa</span><span class="p">)</span> <span class="o">!=</span> <span class="n">vpa_types</span><span class="o">.</span><span class="n">UpdateModeAuto</span> <span class="p">{</span>  <span class="c">// Updater 只在 "Recreate" 或 "Auto" 模式下生效</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="n">selector</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">u</span><span class="o">.</span><span class="n">selectorFetcher</span><span class="o">.</span><span class="n">Fetch</span><span class="p">(</span><span class="n">vpa</span><span class="p">)</span>
		<span class="n">vpas</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">vpas</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">vpa_api_util</span><span class="o">.</span><span class="n">VpaWithSelector</span><span class="p">{</span>
			<span class="n">Vpa</span><span class="o">:</span>      <span class="n">vpa</span><span class="p">,</span>
			<span class="n">Selector</span><span class="o">:</span> <span class="n">selector</span><span class="p">,</span>
		<span class="p">})</span>
	<span class="p">}</span>

	<span class="n">podsList</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">u</span><span class="o">.</span><span class="n">podLister</span><span class="o">.</span><span class="n">List</span><span class="p">(</span><span class="n">labels</span><span class="o">.</span><span class="n">Everything</span><span class="p">())</span>  <span class="c">// 列出所有 Pod 资源</span>
	<span class="n">allLivePods</span> <span class="o">:=</span> <span class="n">filterDeletedPods</span><span class="p">(</span><span class="n">podsList</span><span class="p">)</span>  <span class="c">// 过滤掉所有被删除的 Pod（即 DeletionTimestamp 不为空的）</span>
	<span class="n">controlledPods</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="o">*</span><span class="n">vpa_types</span><span class="o">.</span><span class="n">VerticalPodAutoscaler</span><span class="p">][]</span><span class="o">*</span><span class="n">apiv1</span><span class="o">.</span><span class="n">Pod</span><span class="p">)</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">pod</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">allLivePods</span> <span class="p">{</span>
		<span class="n">controllingVPA</span> <span class="o">:=</span> <span class="n">vpa_api_util</span><span class="o">.</span><span class="n">GetControllingVPAForPod</span><span class="p">(</span><span class="n">pod</span><span class="p">,</span> <span class="n">vpas</span><span class="p">)</span>  <span class="c">// 获取当前 Pod 对应的 VPA 资源</span>
		<span class="k">if</span> <span class="n">controllingVPA</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">controlledPods</span><span class="p">[</span><span class="n">controllingVPA</span><span class="o">.</span><span class="n">Vpa</span><span class="p">]</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">controlledPods</span><span class="p">[</span><span class="n">controllingVPA</span><span class="o">.</span><span class="n">Vpa</span><span class="p">],</span> <span class="n">pod</span><span class="p">)</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="k">for</span> <span class="n">vpa</span><span class="p">,</span> <span class="n">livePods</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">controlledPods</span> <span class="p">{</span>
		<span class="n">evictionLimiter</span> <span class="o">:=</span> <span class="n">u</span><span class="o">.</span><span class="n">evictionFactory</span><span class="o">.</span><span class="n">NewPodsEvictionRestriction</span><span class="p">(</span><span class="n">livePods</span><span class="p">,</span> <span class="n">vpa</span><span class="p">)</span>
		<span class="n">podsForUpdate</span> <span class="o">:=</span> <span class="n">u</span><span class="o">.</span><span class="n">getPodsUpdateOrder</span><span class="p">(</span><span class="n">filterNonEvictablePods</span><span class="p">(</span><span class="n">livePods</span><span class="p">,</span> <span class="n">evictionLimiter</span><span class="p">),</span> <span class="n">vpa</span><span class="p">)</span>  <span class="c">// 获取需要进行资源更新的 Pod 以进行驱逐</span>
		<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">pod</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">podsForUpdate</span> <span class="p">{</span>
			<span class="k">if</span> <span class="o">!</span><span class="n">evictionLimiter</span><span class="o">.</span><span class="n">CanEvict</span><span class="p">(</span><span class="n">pod</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// 判断是否能驱逐</span>
				<span class="k">continue</span>
			<span class="p">}</span>
			<span class="n">evictErr</span> <span class="o">:=</span> <span class="n">evictionLimiter</span><span class="o">.</span><span class="n">Evict</span><span class="p">(</span><span class="n">pod</span><span class="p">,</span> <span class="n">u</span><span class="o">.</span><span class="n">eventRecorder</span><span class="p">)</span>  <span class="c">// 执行驱逐</span>
		<span class="p">}</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="优先级处理">优先级处理</h4>

<p>Updater 通过<code class="language-plaintext highlighter-rouge">getPodsUpdateOrder</code>方法返回一个需要资源更新的 Pods 列表，列表中的 Pod 是<strong>按照更新优先级从高到低排列</strong>的。</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/updater/logic/updater.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">u</span> <span class="o">*</span><span class="n">updater</span><span class="p">)</span> <span class="n">getPodsUpdateOrder</span><span class="p">(</span><span class="n">pods</span> <span class="p">[]</span><span class="o">*</span><span class="n">apiv1</span><span class="o">.</span><span class="n">Pod</span><span class="p">,</span> <span class="n">vpa</span> <span class="o">*</span><span class="n">vpa_types</span><span class="o">.</span><span class="n">VerticalPodAutoscaler</span><span class="p">)</span> <span class="p">[]</span><span class="o">*</span><span class="n">apiv1</span><span class="o">.</span><span class="n">Pod</span> <span class="p">{</span>
	<span class="n">priorityCalculator</span> <span class="o">:=</span> <span class="n">priority</span><span class="o">.</span><span class="n">NewUpdatePriorityCalculator</span><span class="p">(</span><span class="n">vpa</span><span class="p">,</span> <span class="no">nil</span><span class="p">,</span> <span class="n">u</span><span class="o">.</span><span class="n">recommendationProcessor</span><span class="p">,</span> <span class="n">u</span><span class="o">.</span><span class="n">priorityProcessor</span><span class="p">)</span>

	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">pod</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">pods</span> <span class="p">{</span>
		<span class="n">priorityCalculator</span><span class="o">.</span><span class="n">AddPod</span><span class="p">(</span><span class="n">pod</span><span class="p">,</span> <span class="n">time</span><span class="o">.</span><span class="n">Now</span><span class="p">())</span>  <span class="c">// 添加 Pod 并进行一次优先级计算</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">priorityCalculator</span><span class="o">.</span><span class="n">GetSortedPods</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">evictionAdmission</span><span class="p">)</span>  <span class="c">// 按照 Pod 的优先级（ResourceDiff）降序排序</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">AddPod</code>方法用来收集可以进行资源更新的 Pod 对象，这里除了判断更新的资源值是否在推荐值合理范围（<code class="language-plaintext highlighter-rouge">OutsideRecommendedRange</code>）内、更新的资源值是否不变（<code class="language-plaintext highlighter-rouge">ResourceDiff == 0</code>）因素外，还考虑了 Pod 中容器是否有短时间内的 OOM 发生（quick OOM，因为短期内发生了 OOM 证明容器资源设置的过低，急需扩容）。</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/updater/priority/update_priority_calculator.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">calc</span> <span class="o">*</span><span class="n">UpdatePriorityCalculator</span><span class="p">)</span> <span class="n">AddPod</span><span class="p">(</span><span class="n">pod</span> <span class="o">*</span><span class="n">apiv1</span><span class="o">.</span><span class="n">Pod</span><span class="p">,</span> <span class="n">now</span> <span class="n">time</span><span class="o">.</span><span class="n">Time</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">processedRecommendation</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">calc</span><span class="o">.</span><span class="n">recommendationProcessor</span><span class="o">.</span><span class="n">Apply</span><span class="p">(</span><span class="n">calc</span><span class="o">.</span><span class="n">vpa</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">Recommendation</span><span class="p">,</span> <span class="n">calc</span><span class="o">.</span><span class="n">vpa</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">ResourcePolicy</span><span class="p">,</span> <span class="n">calc</span><span class="o">.</span><span class="n">vpa</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">Conditions</span><span class="p">,</span> <span class="n">pod</span><span class="p">)</span>  <span class="c">// 获取资源推荐值</span>

	<span class="n">hasObservedContainers</span><span class="p">,</span> <span class="n">vpaContainerSet</span> <span class="o">:=</span> <span class="n">parseVpaObservedContainers</span><span class="p">(</span><span class="n">pod</span><span class="p">)</span>  <span class="c">// 通过解析 Pod annotation 中的 vpaObservedContainers 字段对应的值，以获取该 Pod 中被 Admission Controller 观察的容器集合</span>

	<span class="n">updatePriority</span> <span class="o">:=</span> <span class="n">calc</span><span class="o">.</span><span class="n">priorityProcessor</span><span class="o">.</span><span class="n">GetUpdatePriority</span><span class="p">(</span><span class="n">pod</span><span class="p">,</span> <span class="n">calc</span><span class="o">.</span><span class="n">vpa</span><span class="p">,</span> <span class="n">processedRecommendation</span><span class="p">)</span>  <span class="c">// 计算更新的优先级</span>

	<span class="c">// 开始快速 OOM 的判断逻辑</span>
	<span class="n">quickOOM</span> <span class="o">:=</span> <span class="no">false</span>
	<span class="k">for</span> <span class="n">i</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">pod</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">ContainerStatuses</span> <span class="p">{</span>
		<span class="n">cs</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">pod</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">ContainerStatuses</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
		<span class="k">if</span> <span class="n">hasObservedContainers</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">vpaContainerSet</span><span class="o">.</span><span class="n">Has</span><span class="p">(</span><span class="n">cs</span><span class="o">.</span><span class="n">Name</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// 对于没有被 Admission Controller 观察到的容器，是不支持快速 OOM 判断的</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="n">crp</span> <span class="o">:=</span> <span class="n">vpa_api_util</span><span class="o">.</span><span class="n">GetContainerResourcePolicy</span><span class="p">(</span><span class="n">cs</span><span class="o">.</span><span class="n">Name</span><span class="p">,</span> <span class="n">calc</span><span class="o">.</span><span class="n">vpa</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">ResourcePolicy</span><span class="p">)</span>
		<span class="k">if</span> <span class="n">crp</span> <span class="o">!=</span> <span class="no">nil</span> <span class="o">&amp;&amp;</span> <span class="n">crp</span><span class="o">.</span><span class="n">Mode</span> <span class="o">!=</span> <span class="no">nil</span> <span class="o">&amp;&amp;</span> <span class="o">*</span><span class="n">crp</span><span class="o">.</span><span class="n">Mode</span> <span class="o">==</span> <span class="n">vpa_types</span><span class="o">.</span><span class="n">ContainerScalingModeOff</span> <span class="p">{</span>  <span class="c">// 对于 ResourcePolicy 为 ContainerScalingModeOff 的情况，也忽略快速 OOM 判断逻辑</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="n">terminationState</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">cs</span><span class="o">.</span><span class="n">LastTerminationState</span>
		<span class="k">if</span> <span class="n">terminationState</span><span class="o">.</span><span class="n">Terminated</span> <span class="o">!=</span> <span class="no">nil</span> <span class="o">&amp;&amp;</span> <span class="n">terminationState</span><span class="o">.</span><span class="n">Terminated</span><span class="o">.</span><span class="n">Reason</span> <span class="o">==</span> <span class="s">"OOMKilled"</span> <span class="o">&amp;&amp;</span>
			<span class="n">terminationState</span><span class="o">.</span><span class="n">Terminated</span><span class="o">.</span><span class="n">FinishedAt</span><span class="o">.</span><span class="n">Time</span><span class="o">.</span><span class="n">Sub</span><span class="p">(</span><span class="n">terminationState</span><span class="o">.</span><span class="n">Terminated</span><span class="o">.</span><span class="n">StartedAt</span><span class="o">.</span><span class="n">Time</span><span class="p">)</span> <span class="o">&lt;</span> <span class="o">*</span><span class="n">evictAfterOOMThreshold</span> <span class="c">/* 默认 10 min */</span> <span class="p">{</span>
			<span class="n">quickOOM</span> <span class="o">=</span> <span class="no">true</span>  <span class="c">// 对于上次终止状态来说，若其产生原因为 OOM 并且持续时间小于一定阈值，则认为是快速的 OOM</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="k">if</span> <span class="o">!</span><span class="n">updatePriority</span><span class="o">.</span><span class="n">OutsideRecommendedRange</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">quickOOM</span> <span class="p">{</span>
		<span class="c">// 处理几种正常情况下的一些异常情况，若出现则直接 return</span>
		<span class="c">// ...</span>
	<span class="p">}</span>

	<span class="c">// 对于经历过快速 OOM 并且资源值不变的情况，则直接返回</span>
	<span class="k">if</span> <span class="n">quickOOM</span> <span class="o">&amp;&amp;</span> <span class="n">updatePriority</span><span class="o">.</span><span class="n">ResourceDiff</span> <span class="o">==</span> <span class="m">0</span> <span class="p">{</span>
		<span class="k">return</span>
	<span class="p">}</span>

	<span class="n">calc</span><span class="o">.</span><span class="n">pods</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">calc</span><span class="o">.</span><span class="n">pods</span><span class="p">,</span> <span class="n">prioritizedPod</span><span class="p">{</span>
		<span class="n">pod</span><span class="o">:</span>            <span class="n">pod</span><span class="p">,</span>
		<span class="n">priority</span><span class="o">:</span>       <span class="n">updatePriority</span><span class="p">,</span>
		<span class="n">recommendation</span><span class="o">:</span> <span class="n">processedRecommendation</span><span class="p">})</span>
<span class="p">}</span>
</code></pre></div></div>

<p>更新的资源值是通过<code class="language-plaintext highlighter-rouge">GetUpdatePriority</code>方法计算的，其返回值类型<code class="language-plaintext highlighter-rouge">PodPriority</code>中的<code class="language-plaintext highlighter-rouge">ResourceDiff</code>表示了<strong>所有资源类型差值（请求值与推荐值差的绝对值）的归一化总和</strong>。后续在对 Pod 进行更新优先级排序时，<code class="language-plaintext highlighter-rouge">ResourceDiff</code>就是<strong>排序所使用的基准</strong>。</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/updater/priority/priority_processor.go</span>

<span class="k">func</span> <span class="p">(</span><span class="o">*</span><span class="n">defaultPriorityProcessor</span><span class="p">)</span> <span class="n">GetUpdatePriority</span><span class="p">(</span><span class="n">pod</span> <span class="o">*</span><span class="n">apiv1</span><span class="o">.</span><span class="n">Pod</span><span class="p">,</span> <span class="n">_</span> <span class="o">*</span><span class="n">vpa_types</span><span class="o">.</span><span class="n">VerticalPodAutoscaler</span><span class="p">,</span>
	<span class="n">recommendation</span> <span class="o">*</span><span class="n">vpa_types</span><span class="o">.</span><span class="n">RecommendedPodResources</span><span class="p">)</span> <span class="n">PodPriority</span> <span class="p">{</span>
	<span class="n">outsideRecommendedRange</span> <span class="o">:=</span> <span class="no">false</span>
	<span class="n">scaleUp</span> <span class="o">:=</span> <span class="no">false</span>

	<span class="n">totalRequestPerResource</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="n">apiv1</span><span class="o">.</span><span class="n">ResourceName</span><span class="p">]</span><span class="kt">int64</span><span class="p">)</span>      <span class="c">// 请求资源的总值，按资源类型分类</span>
	<span class="n">totalRecommendedPerResource</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="n">apiv1</span><span class="o">.</span><span class="n">ResourceName</span><span class="p">]</span><span class="kt">int64</span><span class="p">)</span>  <span class="c">// 推荐资源的总值，按资源类型分类</span>

	<span class="n">hasObservedContainers</span><span class="p">,</span> <span class="n">vpaContainerSet</span> <span class="o">:=</span> <span class="n">parseVpaObservedContainers</span><span class="p">(</span><span class="n">pod</span><span class="p">)</span>  <span class="c">// 函数同上</span>

	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">podContainer</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">pod</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">Containers</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">hasObservedContainers</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">vpaContainerSet</span><span class="o">.</span><span class="n">Has</span><span class="p">(</span><span class="n">podContainer</span><span class="o">.</span><span class="n">Name</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// 只对被 Admission Controller 观察到的容器生效</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="n">recommendedRequest</span> <span class="o">:=</span> <span class="n">vpa_api_util</span><span class="o">.</span><span class="n">GetRecommendationForContainer</span><span class="p">(</span><span class="n">podContainer</span><span class="o">.</span><span class="n">Name</span><span class="p">,</span> <span class="n">recommendation</span><span class="p">)</span>  <span class="c">// 获取该容器对应的推荐值</span>
		<span class="k">if</span> <span class="n">recommendedRequest</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="k">for</span> <span class="n">resourceName</span><span class="p">,</span> <span class="n">recommended</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">recommendedRequest</span><span class="o">.</span><span class="n">Target</span> <span class="p">{</span>
			<span class="n">totalRecommendedPerResource</span><span class="p">[</span><span class="n">resourceName</span><span class="p">]</span> <span class="o">+=</span> <span class="n">recommended</span><span class="o">.</span><span class="n">MilliValue</span><span class="p">()</span>
			<span class="n">lowerBound</span><span class="p">,</span> <span class="n">hasLowerBound</span> <span class="o">:=</span> <span class="n">recommendedRequest</span><span class="o">.</span><span class="n">LowerBound</span><span class="p">[</span><span class="n">resourceName</span><span class="p">]</span>
			<span class="n">upperBound</span><span class="p">,</span> <span class="n">hasUpperBound</span> <span class="o">:=</span> <span class="n">recommendedRequest</span><span class="o">.</span><span class="n">UpperBound</span><span class="p">[</span><span class="n">resourceName</span><span class="p">]</span>
			<span class="k">if</span> <span class="n">request</span><span class="p">,</span> <span class="n">hasRequest</span> <span class="o">:=</span> <span class="n">podContainer</span><span class="o">.</span><span class="n">Resources</span><span class="o">.</span><span class="n">Requests</span><span class="p">[</span><span class="n">resourceName</span><span class="p">];</span> <span class="n">hasRequest</span> <span class="p">{</span>  <span class="c">// 判断几种边界情况：</span>
				<span class="n">totalRequestPerResource</span><span class="p">[</span><span class="n">resourceName</span><span class="p">]</span> <span class="o">+=</span> <span class="n">request</span><span class="o">.</span><span class="n">MilliValue</span><span class="p">()</span>
				<span class="k">if</span> <span class="n">recommended</span><span class="o">.</span><span class="n">MilliValue</span><span class="p">()</span> <span class="o">&gt;</span> <span class="n">request</span><span class="o">.</span><span class="n">MilliValue</span><span class="p">()</span> <span class="p">{</span>  <span class="c">// 1.是否扩容</span>
					<span class="n">scaleUp</span> <span class="o">=</span> <span class="no">true</span>
				<span class="p">}</span>
				<span class="k">if</span> <span class="p">(</span><span class="n">hasLowerBound</span> <span class="o">&amp;&amp;</span> <span class="n">request</span><span class="o">.</span><span class="n">Cmp</span><span class="p">(</span><span class="n">lowerBound</span><span class="p">)</span> <span class="o">&lt;</span> <span class="m">0</span><span class="p">)</span> <span class="o">||</span>
					<span class="p">(</span><span class="n">hasUpperBound</span> <span class="o">&amp;&amp;</span> <span class="n">request</span><span class="o">.</span><span class="n">Cmp</span><span class="p">(</span><span class="n">upperBound</span><span class="p">)</span> <span class="o">&gt;</span> <span class="m">0</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// 2.是否越界</span>
					<span class="n">outsideRecommendedRange</span> <span class="o">=</span> <span class="no">true</span>
				<span class="p">}</span>
			<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
				<span class="n">scaleUp</span> <span class="o">=</span> <span class="no">true</span>
				<span class="n">outsideRecommendedRange</span> <span class="o">=</span> <span class="no">true</span>
			<span class="p">}</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="n">resourceDiff</span> <span class="o">:=</span> <span class="m">0.0</span>  <span class="c">// 所有资源类型差值的总和</span>
	<span class="k">for</span> <span class="n">resource</span><span class="p">,</span> <span class="n">totalRecommended</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">totalRecommendedPerResource</span> <span class="p">{</span>
		<span class="n">totalRequest</span> <span class="o">:=</span> <span class="n">math</span><span class="o">.</span><span class="n">Max</span><span class="p">(</span><span class="kt">float64</span><span class="p">(</span><span class="n">totalRequestPerResource</span><span class="p">[</span><span class="n">resource</span><span class="p">]),</span> <span class="m">1.0</span><span class="p">)</span>
		<span class="n">resourceDiff</span> <span class="o">+=</span> <span class="n">math</span><span class="o">.</span><span class="n">Abs</span><span class="p">(</span><span class="n">totalRequest</span><span class="o">-</span><span class="kt">float64</span><span class="p">(</span><span class="n">totalRecommended</span><span class="p">))</span> <span class="o">/</span> <span class="n">totalRequest</span>  <span class="c">// 对每种资源类型差值都进行了归一化</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">PodPriority</span><span class="p">{</span>
		<span class="n">OutsideRecommendedRange</span><span class="o">:</span> <span class="n">outsideRecommendedRange</span><span class="p">,</span>
		<span class="n">ScaleUp</span><span class="o">:</span>                 <span class="n">scaleUp</span><span class="p">,</span>
		<span class="n">ResourceDiff</span><span class="o">:</span>            <span class="n">resourceDiff</span><span class="p">,</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="evict">Evict</h4>

<p>对于每一个需要更新资源值的 Pod，Updater 都会先检测该 Pod 是否能被驱逐，若能，则将其驱逐；若不能，则跳过此次驱逐。</p>

<p>Updater 对 Pod 是否能够被驱逐的判断是通过<code class="language-plaintext highlighter-rouge">CanEvict</code>方法来完成的。<strong>它既保证了一个 Pod 对应的 Controller 只能驱逐可容忍范围内的 Pod 副本数，又保证了该副本数不会为 0（至少为 1）</strong>。</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/updater/eviction/pods_eviction_restriction.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">e</span> <span class="o">*</span><span class="n">podsEvictionRestrictionImpl</span><span class="p">)</span> <span class="n">CanEvict</span><span class="p">(</span><span class="n">pod</span> <span class="o">*</span><span class="n">apiv1</span><span class="o">.</span><span class="n">Pod</span><span class="p">)</span> <span class="kt">bool</span> <span class="p">{</span>
	<span class="n">cr</span><span class="p">,</span> <span class="n">present</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">podToReplicaCreatorMap</span><span class="p">[</span><span class="n">getPodID</span><span class="p">(</span><span class="n">pod</span><span class="p">)]</span>  <span class="c">// 根据 pod ID 找到其控制器</span>
	<span class="k">if</span> <span class="n">present</span> <span class="p">{</span>
		<span class="n">singleGroupStats</span><span class="p">,</span> <span class="n">present</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">creatorToSingleGroupStatsMap</span><span class="p">[</span><span class="n">cr</span><span class="p">]</span>
		<span class="k">if</span> <span class="n">pod</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">Phase</span> <span class="o">==</span> <span class="n">apiv1</span><span class="o">.</span><span class="n">PodPending</span> <span class="p">{</span>
			<span class="k">return</span> <span class="no">true</span>  <span class="c">// 对于处于 Pending 状态的 Pod，可以被驱逐</span>
		<span class="p">}</span>
		<span class="k">if</span> <span class="n">present</span> <span class="p">{</span>
			<span class="n">shouldBeAlive</span> <span class="o">:=</span> <span class="n">singleGroupStats</span><span class="o">.</span><span class="n">configured</span> <span class="o">-</span> <span class="n">singleGroupStats</span><span class="o">.</span><span class="n">evictionTolerance</span>  <span class="c">// 由 evictionToleranceFraction 控制，表示最多能驱逐的副本数</span>
			<span class="k">if</span> <span class="n">singleGroupStats</span><span class="o">.</span><span class="n">running</span><span class="o">-</span><span class="n">singleGroupStats</span><span class="o">.</span><span class="n">evicted</span> <span class="o">&gt;</span> <span class="n">shouldBeAlive</span> <span class="p">{</span>
				<span class="k">return</span> <span class="no">true</span>  <span class="c">// 对于可容忍的驱逐数量之内，可以被驱逐</span>
			<span class="p">}</span>
			<span class="k">if</span> <span class="n">singleGroupStats</span><span class="o">.</span><span class="n">running</span> <span class="o">==</span> <span class="n">singleGroupStats</span><span class="o">.</span><span class="n">configured</span> <span class="o">&amp;&amp;</span>
				<span class="n">singleGroupStats</span><span class="o">.</span><span class="n">evictionTolerance</span> <span class="o">==</span> <span class="m">0</span> <span class="o">&amp;&amp;</span>
				<span class="n">singleGroupStats</span><span class="o">.</span><span class="n">evicted</span> <span class="o">==</span> <span class="m">0</span> <span class="p">{</span>
				<span class="k">return</span> <span class="no">true</span>  <span class="c">// 若所有 Pods 都在运行，并且可容忍的驱逐数量过小，则只可以驱逐一个</span>
			<span class="p">}</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="no">false</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">Evict</code>函数负责对一个 Pod 进行驱逐，使用的是<code class="language-plaintext highlighter-rouge">policy/v1</code> Group 下的 API，可以对目的 Pod 发送一个驱逐请求。</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vertical-pod-autoscaler/pkg/updater/eviction/pods_eviction_restriction.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">e</span> <span class="o">*</span><span class="n">podsEvictionRestrictionImpl</span><span class="p">)</span> <span class="n">Evict</span><span class="p">(</span><span class="n">podToEvict</span> <span class="o">*</span><span class="n">apiv1</span><span class="o">.</span><span class="n">Pod</span><span class="p">,</span> <span class="n">eventRecorder</span> <span class="n">record</span><span class="o">.</span><span class="n">EventRecorder</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">cr</span><span class="p">,</span> <span class="n">present</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">podToReplicaCreatorMap</span><span class="p">[</span><span class="n">getPodID</span><span class="p">(</span><span class="n">podToEvict</span><span class="p">)]</span>

	<span class="k">if</span> <span class="o">!</span><span class="n">e</span><span class="o">.</span><span class="n">CanEvict</span><span class="p">(</span><span class="n">podToEvict</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// 再次判断 Pod 是否可被驱逐</span>
		<span class="k">return</span>
	<span class="p">}</span>

	<span class="n">eviction</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">policyv1</span><span class="o">.</span><span class="n">Eviction</span><span class="p">{</span>
		<span class="n">ObjectMeta</span><span class="o">:</span> <span class="n">metav1</span><span class="o">.</span><span class="n">ObjectMeta</span><span class="p">{</span>
			<span class="n">Namespace</span><span class="o">:</span> <span class="n">podToEvict</span><span class="o">.</span><span class="n">Namespace</span><span class="p">,</span>
			<span class="n">Name</span><span class="o">:</span>      <span class="n">podToEvict</span><span class="o">.</span><span class="n">Name</span><span class="p">,</span>
		<span class="p">},</span>
	<span class="p">}</span>
	<span class="n">err</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">CoreV1</span><span class="p">()</span><span class="o">.</span><span class="n">Pods</span><span class="p">(</span><span class="n">podToEvict</span><span class="o">.</span><span class="n">Namespace</span><span class="p">)</span><span class="o">.</span><span class="n">EvictV1</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">TODO</span><span class="p">(),</span> <span class="n">eviction</span><span class="p">)</span>  <span class="c">// 触发驱逐事件</span>

	<span class="k">if</span> <span class="n">podToEvict</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">Phase</span> <span class="o">!=</span> <span class="n">apiv1</span><span class="o">.</span><span class="n">PodPending</span> <span class="p">{</span>
		<span class="n">singleGroupStats</span><span class="p">,</span> <span class="n">present</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">creatorToSingleGroupStatsMap</span><span class="p">[</span><span class="n">cr</span><span class="p">]</span>
		<span class="n">singleGroupStats</span><span class="o">.</span><span class="n">evicted</span> <span class="o">=</span> <span class="n">singleGroupStats</span><span class="o">.</span><span class="n">evicted</span> <span class="o">+</span> <span class="m">1</span>          <span class="c">// 增加相应的驱逐次数</span>
		<span class="n">e</span><span class="o">.</span><span class="n">creatorToSingleGroupStatsMap</span><span class="p">[</span><span class="n">cr</span><span class="p">]</span> <span class="o">=</span> <span class="n">singleGroupStats</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="总结">总结</h3>

<p>Autoscaler 是 Kubernetes 社区维护的一个集群自动化扩缩容工具库，VPA 只是其中的一个模块。目前许多公有云的 VPA 实现，也都与 Autoscaler 的 VPA 实现类似，比如 GKE 等。但 GKE 相比 Autoscaler 还存在一些改进：</p>

<ul>
  <li>在资源推荐值计算时，额外考虑了支持最大节点数与单节点资源限额</li>
  <li>VPA 能够通知 Cluster Autoscaler 来调整集群容量</li>
  <li>将 VPA 作为一个控制面的进程，而非 Worker 节点中的 Deployments</li>
</ul>

<p>Autoscaler 的 VPA 是基于对 Pod 的驱逐重建完成的，在部分对驱逐敏感的场景下，Autoscaler 其实并不能很好的胜任 VPA 工作。面对这种场景，就需要一种可以原地更新 Pod 资源的技术。</p>

<h2 id="资源原地更新">资源原地更新</h2>

<blockquote>
  <p>此部分内容对应的代码基于 Kubernetes HEAD <a href="https://github.com/kubernetes/kubernetes/commit/4c18d40af128ff4504e89ffd273a2b62fcdbd2f5">4c18d40</a> 和 containerd HEAD <a href="https://github.com/containerd/containerd/commit/03e4f1e3637ef7c0c33bdcb71642c02afa4f1298">03e4f1e</a>。</p>
</blockquote>

<p>Pod 资源的原地（In-Place）更新主要指原地更新 Pod Resources 的 request 和 limit 值。在 K8s 中，该功能由 <a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources">KEP-1287</a> 引入，并由 PR <a href="https://github.com/kubernetes/kubernetes/pull/102884">#102884</a> 实现。该功能对应的大致流程如下所示：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   +-----------+                   +-----------+                  +-----------+
   |           |                   |           |                  |           |
   | apiserver |                   |  kubelet  |                  |  runtime  |
   |           |                   |           |                  |           |
   +-----+-----+                   +-----+-----+                  +-----+-----+
         |                               |                              |
         |       watch (pod update)      |                              |
         |------------------------------&gt;|                              |
         |     [Containers.Resources]    |                              |
         |                               |                              |
         |                            (admit)                           |
         |                               |                              |
         |                               |  UpdateContainerResources()  |
         |                               |-----------------------------&gt;|
         |                               |                         (set limits)
         |                               |&lt;- - - - - - - - - - - - - - -|
         |                               |                              |
         |                               |      ContainerStatus()       |
         |                               |-----------------------------&gt;|
         |                               |                              |
         |                               |     [ContainerResources]     |
         |                               |&lt;- - - - - - - - - - - - - - -|
         |                               |                              |
         |      update (pod status)      |                              |
         |&lt;------------------------------|                              |
         | [ContainerStatuses.Resources] |                              |
         |                               |                              |
</code></pre></div></div>

<p>在 K8s 中，一个新创建的 Pod，其<code class="language-plaintext highlighter-rouge">Pod.Spec.Containers[i].AllocatedResources</code>字段是由 api-server 设置的，用以匹配每个容器所请求的资源<code class="language-plaintext highlighter-rouge">Pod.Spec.Containers[i].Resources.Requests</code>。当 kubelet 准备创建一个 Pod 时，它会根据 Pod 的<code class="language-plaintext highlighter-rouge">AllocatedResources</code>字段来判断当前节点是否还能容纳此 Pod。</p>

<p>当一个 Pod 发生 Resize 时，kubelet 会尝试更新其内部容器资源的分配值。kubelet 首先检查新的期望资源值是否超过了当前节点的资源可用值，若资源不合适，则返回<code class="language-plaintext highlighter-rouge">Infeasible</code>状态；若资源合适但 Pod 不可用，则返回<code class="language-plaintext highlighter-rouge">Deferred</code>状态；若资源合适则返回<code class="language-plaintext highlighter-rouge">InProgress</code>状态。</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// kubernetes/pkg/kubelet/kubelet.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">kl</span> <span class="o">*</span><span class="n">Kubelet</span><span class="p">)</span> <span class="n">canResizePod</span><span class="p">(</span><span class="n">pod</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">)</span> <span class="p">(</span><span class="kt">bool</span><span class="p">,</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">PodResizeStatus</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">var</span> <span class="n">otherActivePods</span> <span class="p">[]</span><span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span>
	<span class="n">node</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">kl</span><span class="o">.</span><span class="n">getNodeAnyWay</span><span class="p">()</span>

	<span class="n">podCopy</span> <span class="o">:=</span> <span class="n">pod</span><span class="o">.</span><span class="n">DeepCopy</span><span class="p">()</span>
	<span class="n">cpuAvailable</span> <span class="o">:=</span> <span class="n">node</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">Allocatable</span><span class="o">.</span><span class="n">Cpu</span><span class="p">()</span><span class="o">.</span><span class="n">MilliValue</span><span class="p">()</span>
	<span class="n">memAvailable</span> <span class="o">:=</span> <span class="n">node</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">Allocatable</span><span class="o">.</span><span class="n">Memory</span><span class="p">()</span><span class="o">.</span><span class="n">Value</span><span class="p">()</span>
	<span class="n">cpuRequests</span> <span class="o">:=</span> <span class="n">resource</span><span class="o">.</span><span class="n">GetResourceRequest</span><span class="p">(</span><span class="n">podCopy</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">ResourceCPU</span><span class="p">)</span>
	<span class="n">memRequests</span> <span class="o">:=</span> <span class="n">resource</span><span class="o">.</span><span class="n">GetResourceRequest</span><span class="p">(</span><span class="n">podCopy</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">ResourceMemory</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">cpuRequests</span> <span class="o">&gt;</span> <span class="n">cpuAvailable</span> <span class="o">||</span> <span class="n">memRequests</span> <span class="o">&gt;</span> <span class="n">memAvailable</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">false</span><span class="p">,</span> <span class="n">podCopy</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">PodResizeStatusInfeasible</span>
	<span class="p">}</span>

	<span class="n">activePods</span> <span class="o">:=</span> <span class="n">kl</span><span class="o">.</span><span class="n">GetActivePods</span><span class="p">()</span>  <span class="c">// 处于 Terminal 状态的 Pods 属于 Inactive</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">p</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">activePods</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">p</span><span class="o">.</span><span class="n">UID</span> <span class="o">!=</span> <span class="n">pod</span><span class="o">.</span><span class="n">UID</span> <span class="p">{</span>
			<span class="n">otherActivePods</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">otherActivePods</span><span class="p">,</span> <span class="n">p</span><span class="p">)</span>  <span class="c">// 收集非 Active 的 Pods</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="k">if</span> <span class="n">ok</span><span class="p">,</span> <span class="n">failReason</span><span class="p">,</span> <span class="n">failMessage</span> <span class="o">:=</span> <span class="n">kl</span><span class="o">.</span><span class="n">canAdmitPod</span><span class="p">(</span><span class="n">otherActivePods</span><span class="p">,</span> <span class="n">podCopy</span><span class="p">);</span> <span class="o">!</span><span class="n">ok</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">false</span><span class="p">,</span> <span class="n">podCopy</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">PodResizeStatusDeferred</span>
	<span class="p">}</span>

	<span class="c">// ...</span>
	<span class="k">return</span> <span class="no">true</span><span class="p">,</span> <span class="n">podCopy</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">PodResizeStatusInProgress</span>
<span class="p">}</span>
</code></pre></div></div>

<p>kubelet 是通过调用 CRI 中 ContainerManager 的<code class="language-plaintext highlighter-rouge">UpdateContainerResources</code> API 来更新对应容器的 CPU 和内存 Limits 值的。在 containerd 中，该 API 对应的实现如下所示。其通过 NRI 提供的<code class="language-plaintext highlighter-rouge">UpdateContainerResources</code> API 来完成真正的资源更新操作。</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// containerd/pkg/cri/server/container_update_resources.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">criService</span><span class="p">)</span> <span class="n">UpdateContainerResources</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">runtime</span><span class="o">.</span><span class="n">UpdateContainerResourcesRequest</span><span class="p">)</span> <span class="p">(</span><span class="n">retRes</span> <span class="o">*</span><span class="n">runtime</span><span class="o">.</span><span class="n">UpdateContainerResourcesResponse</span><span class="p">,</span> <span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">container</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">containerStore</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">GetContainerId</span><span class="p">())</span>  <span class="c">// 获取目标 container</span>
	<span class="n">sandbox</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">sandboxStore</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="n">container</span><span class="o">.</span><span class="n">SandboxID</span><span class="p">)</span>     <span class="c">// 获取 container 所在 sandbox</span>

	<span class="n">resources</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">GetLinux</span><span class="p">()</span>
	<span class="n">updated</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">nri</span><span class="o">.</span><span class="n">UpdateContainerResources</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">sandbox</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">container</span><span class="p">,</span> <span class="n">resources</span><span class="p">)</span> <span class="c">// 通过 nri 更新容器资源配置</span>
	<span class="k">if</span> <span class="n">updated</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="o">*</span><span class="n">resources</span> <span class="o">=</span> <span class="o">*</span><span class="n">updated</span>
	<span class="p">}</span>

	<span class="n">err</span> <span class="o">:=</span> <span class="n">container</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">UpdateSync</span><span class="p">(</span><span class="k">func</span><span class="p">(</span><span class="n">status</span> <span class="n">containerstore</span><span class="o">.</span><span class="n">Status</span><span class="p">)</span> <span class="p">(</span><span class="n">containerstore</span><span class="o">.</span><span class="n">Status</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// 更新资源状态</span>
		<span class="k">return</span> <span class="n">c</span><span class="o">.</span><span class="n">updateContainerResources</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">container</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">status</span><span class="p">)</span>
	<span class="p">})</span>

	<span class="k">return</span> <span class="o">&amp;</span><span class="n">runtime</span><span class="o">.</span><span class="n">UpdateContainerResourcesResponse</span><span class="p">{},</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="热迁移与-vpa">热迁移与 VPA</h2>

<p>在今年的 KubeCon 2023 Asia Shanghai 分享了一个议题<a href="https://sched.co/1RT6O">《在 Kubernetes 生产环境中的容器实时迁移》</a>，也提到了 VPA 现在面临的一个痛点：<strong>在当前节点资源不足时，就无法再支撑 Pod 的垂直扩容</strong>。这个问题比较好的解决方案就是容器的热迁移（又称实时迁移）。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-09-30/live-migration.png" alt="live-migration" /></p>

<p>分享者表示，容器的实时迁移（Rescheduling，但不同于普通的重调度，这里要求容器在重新调度后，容器的状态还继续保持调度前的状态，例如用户数据、容器状态等等）在云原生场景下的应具备以下几点核心能力：</p>

<ul>
  <li><strong>基本的 Reschedule 能力</strong>，工作负载可以从一个节点实时迁移到另外一个节点去</li>
  <li><strong>拓扑优化能力</strong>，根据工作负载真正运行的位置来通过实时迁移优化其拓扑结构，而非单纯靠提前的规划与预测能力（这里对比的是 K8s 中的调度器，可以理解为调度是一个一次性的操作，而集群的资源是一个动态变化的环境，所以能够实时的根据集群的资源变化动态调整/迁移负载变得尤为重要）</li>
  <li><strong>资源碎片调整能力</strong>，动态的对集群资源进行调整，以适配不同的资源请求，避免每个节点都只被请求了部分资源，造成资源碎片的产生</li>
</ul>

<p>这里的最后一点能力表示：<strong>VPA 面临节点资源不足，无法再进行资源申请的情况下，也可以通过热迁移来为节点“腾出”资源以保证 VPA 的顺利进行</strong>。</p>

<h2 id="reference">Reference</h2>

<ol>
  <li><a href="https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler">https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler</a></li>
  <li><a href="https://github.com/kubernetes/design-proposals-archive/blob/main/autoscaling/vertical-pod-autoscaler.md">https://github.com/kubernetes/design-proposals-archive/blob/main/autoscaling/vertical-pod-autoscaler.md</a></li>
  <li><a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources">https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources</a></li>
  <li><a href="https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler">https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler</a></li>
  <li><a href="https://static.sched.com/hosted_files/kccncosschn2023/d1/live%20migration-eng.pdf">https://static.sched.com/hosted_files/kccncosschn2023/d1/live%20migration-eng.pdf</a></li>
</ol>]]></content><author><name>Your Name</name><email>shawnhxh@outlook.com</email></author><category term="post" /><category term="Kubernetes" /><summary type="html"><![CDATA[Pod 自动垂直伸缩（Vertical Pod Autoscaler，VPA）是 K8s 中集群资源控制的重要一部分。它主要有两个目的： 通过自动化配置所需资源的方式来降低集群的维护成本 提升集群资源的利用率，减少集群中容器发生 OOM 或 CPU 饥饿的风险 本文以 VPA 为切入点，分析了 Autoscaler 和 Kubernetes In-Place 的 VPA 实现方式。 Autoscaler 此部分内容对应的代码基于 Autoscaler HEAD fbe25e1。 Autoscaler 的 VPA 会根据 Pod 的真实用量来自动的调整 Pod 所需的资源值，它通过引入 VerticalPodAutoscaler API 资源来实现，该资源定义了匹配哪些 Pod（label selector）使用何种更新策略（update policy）去更新以何种方式（resources policy）计算的资源值。 Autoscaler 的 VPA 由以下模块配合实现： Recommender，负责计算一个 VPA 对象中所匹配 Pod 的资源推荐值 Admission Controller，负责拦截所有 Pod 的创建请求，并覆盖匹配到 VPA 对象的 Pod 资源值字段 Updater，负责 Pod 资源的实时更新]]></summary></entry><entry><title type="html">GreptimeDB 的 KubeBlocks 集成经验分享</title><link href="https://shawnh2.github.io/post/2023/08/28/greptimedb-x-kubeblocks.html" rel="alternate" type="text/html" title="GreptimeDB 的 KubeBlocks 集成经验分享" /><published>2023-08-28T00:00:00+08:00</published><updated>2023-08-28T00:00:00+08:00</updated><id>https://shawnh2.github.io/post/2023/08/28/greptimedb-x-kubeblocks</id><content type="html" xml:base="https://shawnh2.github.io/post/2023/08/28/greptimedb-x-kubeblocks.html"><![CDATA[<blockquote>
  <p>本文同为:</p>
  <ul>
    <li>Greptime 官方微信公众号推文：<a href="https://mp.weixin.qq.com/s/sIaJ6Ysp53wQzwwPJk9LuQ">GreptimeDB 的 KubeBlocks 集成经验分享</a></li>
    <li>Greptime Official Blogs: <a href="https://greptime.com/blogs/2023-09-06-greptime-with-cubeblocks">Hands-on Experience of Integrating GreptimeDB with KubeBlocks</a></li>
  </ul>
</blockquote>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-08-28/coverimage.png" alt="kb-banner" /></p>

<h2 id="kubeblocks-是什么">KubeBlocks 是什么</h2>
<p><a href="https://github.com/apecloud/kubeblocks">KubeBlocks</a> 是一款由 <a href="https://kubeblocks.io/">ApeCloud</a> 开源的云原生数据基础设施，旨在帮助应用开发者和平台工程师在 Kubernetes 上更好地管理数据库和各种分析型工作负载。KubeBlocks 支持多个云服务商，并且提供了一套声明式、统一的方式来提升 DevOps 效率。</p>

<p>KubeBlocks 目前支持关系型数据库、NoSQL 数据库、向量数据库、时序数据库、图数据库以及流计算系统等多种数据基础设施。</p>

<!--more-->

<p>KubeBlocks 的名字源自 Kubernetes（K8s）和乐高积木（Blocks），致力于让 K8s 上的数据基础设施管理就像搭建乐高积木一样，既高效又有趣。</p>

<h2 id="为什么集成-kubeblocks">为什么集成 KubeBlocks</h2>
<p>现如今，构建数据基础设施在 K8s 上变得越来越流行。然而，这其中最棘手的障碍莫过于：<strong>与云提供商集成的困难、缺乏可靠的 Operators 以及陡峭的 K8s 学习曲线</strong>。</p>

<p>KubeBlocks 提供了一个开源选择，既可以帮助应用开发者和平台工程师为各种数据基础设施配置更多丰富的功能与服务，又可以帮助非 K8s 专业人士快速的搭建全栈、生产级的数据基础设施。</p>

<p>GreptimeDB 集成 KubeBlocks，不仅获得了更加方便、快捷的集群部署方式，而且还可以享受到 KubeBlocks 提供的扩缩容、监控、备份与恢复等强大的集群管理能力。何乐而不为？</p>

<h2 id="kubeblocks-集成思路">KubeBlocks 集成思路</h2>
<p>KubeBlocks 将一个集群（Cluster）所需的信息分成了三类：</p>

<ul>
  <li>拓扑信息，即 <a href="https://kubeblocks.io/docs/preview/user_docs/api-reference/cluster#apps.kubeblocks.io/v1alpha1.ClusterDefinition">ClusterDefinition</a> 资源对象，定义了集群所需组件及组件的部署方式等信息</li>
  <li>版本信息，即 <a href="https://kubeblocks.io/docs/preview/user_docs/api-reference/cluster#apps.kubeblocks.io/v1alpha1.ClusterVersion">ClusterVersion</a> 资源对象，定义了各组件镜像版本及相关配置信息</li>
  <li>资源信息，即 <a href="https://kubeblocks.io/docs/preview/user_docs/api-reference/cluster#apps.kubeblocks.io/v1alpha1.Cluster">Cluster</a> 资源对象，定义了 CPU、内存、磁盘及副本数等资源信息</li>
</ul>

<p>KubeBlocks 将一个集群中的拓扑、版本和资源解耦，使得每一个对象描述的信息都更加的清晰和聚焦，通过这些对象的组合可以生成更丰富的集群。
由上述三种对象描述的一个集群，其对象之间的组成关系如下图所示。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-08-28/kubeblocks.png" alt="kubeblocks" /></p>

<p>其中，ComponentDef 定义了一个集群中某个组件的部署信息，而 ComponentDefRef 描述了对某组件定义的一个引用。在此引用中，可以定义与对应组件相关的各种对象信息（比如在 ClusterVersion 的<code class="language-plaintext highlighter-rouge">ComponentDefRef: A</code>中定义组件 A 所使用的镜像版本为 latest；在 Cluster 的<code class="language-plaintext highlighter-rouge">ComponentDefRef: A</code>中定义组件 A 的副本数为 3 等等）。</p>

<p>综上所述，集成 KubeBlocks 实质上就是<strong>声明能够描述一个集群的拓扑、版本和资源的信息</strong>。</p>

<h2 id="greptimedb-集群架构简介">GreptimeDB 集群架构简介</h2>

<p>GreptimeDB 集群的架构由三个组件组成：meta、frontend 和 datanode，如下图所示。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-08-28/greptimedb-cluster-architecture.png" alt="greptimedb-cluster-architecture" /></p>

<p>其中：</p>
<ul>
  <li>frontend 负责暴露不同协议的读写接口，转发请求到 datanode；属于无状态类型组件</li>
  <li>datanode 负责数据的持久化存储；属于有状态类型组件</li>
  <li>meta 负责 frontend 与 datanode 间的协同；属于无状态类型组件；本文假设 meta 所使用的 kv-store 为 etcd</li>
</ul>

<h2 id="集成经验分享">集成经验分享</h2>
<p>有关完整的 GreptimeDB 对 KubeBlocks 的集成与运行方式，可以参考以下 PR：</p>

<ul>
  <li><a href="https://github.com/apecloud/kubeblocks/pull/4822">https://github.com/apecloud/kubeblocks/pull/4822</a></li>
  <li><a href="https://github.com/apecloud/kubeblocks/pull/4855">https://github.com/apecloud/kubeblocks/pull/4855</a></li>
</ul>

<p>本文不会对详细的配置信息展开赘述，而是分享几点在集成过程中的经验，希望对读者有所帮助。</p>
<h3 id="跨组件的值引用">跨组件的值引用</h3>
<p>在一个集群中，有时会出现一个组件引用另一个组件中值的情况。比如在 GreptimeDB 集群中，frontend 组件引用了 meta 组件和 datanode 组件的 Service 地址。</p>

<p>KubeBlocks 提供了一个 <a href="https://kubeblocks.io/docs/release-0.6/user_docs/api-reference/cluster#apps.kubeblocks.io/v1alpha1.ComponentDefRef">componentDefRef 字段</a>，允许跨组件值引用的发生。如下配置所示，frontend 组件声明了一个名为<code class="language-plaintext highlighter-rouge">metaRef</code>的引用，其引用了 meta 组件所创建 Service 的服务名，并且将该服务名保存在了<code class="language-plaintext highlighter-rouge">GREPTIMEDB_META_SVC</code>环境变量中，可供 frontend 组件或其他声明了该引用的组件使用。</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">componentDefs</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">frontend</span>
    <span class="na">componentDefRef</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="nl">&amp;metaRef</span>
        <span class="na">componentDefName</span><span class="pi">:</span> <span class="s">meta</span>
        <span class="na">componentRefEnv</span><span class="pi">:</span>
          <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">GREPTIMEDB_META_SVC</span>
            <span class="na">valueFrom</span><span class="pi">:</span>
              <span class="na">type</span><span class="pi">:</span> <span class="s">ServiceRef</span>
    <span class="c1"># ...</span>
    <span class="na">containers</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">frontend</span>
        <span class="na">args</span><span class="pi">:</span>
          <span class="pi">-</span> <span class="s">--metasrv-addr</span>
          <span class="pi">-</span> <span class="s">$(GREPTIMEDB_META_SVC).$(KB_NAMESPACE).svc:3002</span>
          <span class="c1"># ...</span>
  
  <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">datanode</span>
    <span class="na">componentDefRef</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="nv">*metaRef</span>
    <span class="na">podSpec</span><span class="pi">:</span>
      <span class="na">containers</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">datanode</span>
          <span class="na">args</span><span class="pi">:</span>
            <span class="pi">-</span> <span class="s">--metasrv-addr</span>
            <span class="pi">-</span> <span class="s">$(GREPTIMEDB_META_SVC).$(KB_NAMESPACE).svc:3002</span>
            <span class="c1"># ...</span>
</code></pre></div></div>

<p>不仅有对 Service 的引用，KubeBlocks 还支持对<a href="https://kubeblocks.io/docs/release-0.6/user_docs/api-reference/cluster#apps.kubeblocks.io/v1alpha1.ComponentValueFromType">组件 Spec 中的字段（Field）或 Headless Service</a> 的引用。</p>

<h3 id="组件之间的启动顺序约束">组件之间的启动顺序约束</h3>

<p>一般一个集群会由多个组件组成，一个组件的启动可能依赖于另一个组件的状态。以 GreptimeDB 集群为例，其四个组件要依次按照 etcd、meta、datanode 和 frontend 的顺序启动。</p>

<p>KubeBlocks 在部署一个集群时，会同时启动所有组件。由于各组件的启动是无序的，若一个被依赖的组件在某个依赖它的组件启动之后运行，就会导致后者的启动失败，触发重启。比如 etcd 组件在 meta 组件启动之后才运行，就会导致 meta 组件的重启。若对各组件的启动顺序置之不理，虽然集群最后也能成功部署，但无疑增加了集群整体部署的时长；而且每个组件都会“平白无故”的增加重启计数，显然不够“优雅”。</p>

<p>考虑到 K8s 提供的 <a href="https://kubernetes.io/docs/concepts/workloads/pods/init-containers/">Init Container</a> 功能，故在需要组件间启动顺序约束的场景下，可以引入<code class="language-plaintext highlighter-rouge">initContainers</code>来检测所依赖组件的状态。如下配置所示，配合<code class="language-plaintext highlighter-rouge">componentDefRef</code>功能，meta 会等待 etcd 的 Service 创建完成后再启动。</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">componentDefs</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">meta</span>
    <span class="na">componentDefRef</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="nl">&amp;etcdRef</span>
        <span class="na">componentDefName</span><span class="pi">:</span> <span class="s">etcd</span>
        <span class="na">componentRefEnv</span><span class="pi">:</span>
          <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">GREPTIMEDB_ETCD_SVC</span>
            <span class="na">valueFrom</span><span class="pi">:</span>
              <span class="na">type</span><span class="pi">:</span> <span class="s">ServiceRef</span>
    <span class="na">podSpec</span><span class="pi">:</span>
      <span class="na">initContainers</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">wait-etcd</span>
          <span class="na">image</span><span class="pi">:</span> <span class="s">busybox:1.28</span>
          <span class="na">imagePullPolicy</span><span class="pi">:</span> 
          <span class="na">command</span><span class="pi">:</span>
            <span class="pi">-</span> <span class="s">bin/sh</span>
            <span class="pi">-</span> <span class="s">-c</span>
            <span class="pi">-</span> <span class="pi">|</span>
              <span class="s">until nslookup ${GREPTIMEDB_ETCD_SVC}-headless.${KB_NAMESPACE}.svc; do</span>
                <span class="s">echo "waiting for etcd"; sleep 2;</span>
              <span class="s">done;</span>
      <span class="c1"># ...</span>
</code></pre></div></div>

<h3 id="灵活的-configmap-挂载">灵活的 ConfigMap 挂载</h3>

<p>在 ClusterDefinition 配置中，我们往往会“不自觉地”将 ConfigMap 在组件的 containers 中挂载：</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ConfigMap</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">greptimedb-meta</span>
<span class="c1"># ...</span>
<span class="nn">---</span>
<span class="c1"># ...</span>
<span class="na">componentDefs</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">meta</span>
    <span class="na">podSpec</span><span class="pi">:</span>
      <span class="na">containers</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">meta</span>
          <span class="na">volumeMounts</span><span class="pi">:</span>
            <span class="pi">-</span> <span class="na">mountPath</span><span class="pi">:</span> <span class="s">/etc/greptimedb</span>
              <span class="na">name</span><span class="pi">:</span> <span class="s">meta-config</span>
          <span class="c1"># ...</span>
      <span class="na">volumes</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">configMap</span><span class="pi">:</span>
            <span class="na">name</span><span class="pi">:</span> <span class="s">greptimedb-meta</span>
          <span class="na">name</span><span class="pi">:</span> <span class="s">meta-config</span>
</code></pre></div></div>

<p>这种挂载方式在当 Cluster、ClusterDefinition、ClusterVersion 对象位于同一个命名空间下时才生效，若它们位于不同命名空间下时，ConfigMap 的挂载就失效了。因为 ConfigMap 是一种 Namespaced 资源对象。</p>

<p>KubeBlocks 提供了一个 <a href="https://kubeblocks.io/docs/release-0.6/user_docs/api-reference/cluster#apps.kubeblocks.io/v1alpha1.ComponentConfigSpec">ConfigSpec 字段</a>来解决上述问题。如下述配置所示，<code class="language-plaintext highlighter-rouge">templateRef</code>对应所引用的 ConfigMap 的名称。</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ConfigMap</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">greptimedb-meta</span>
<span class="c1"># ...</span>
<span class="nn">---</span>
<span class="c1"># ...</span>
<span class="na">componentDefs</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">meta</span>
    <span class="na">configSpecs</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">greptimedb-meta</span>
        <span class="na">templateRef</span><span class="pi">:</span> <span class="s">greptimedb-meta</span>
        <span class="na">volumeName</span><span class="pi">:</span> <span class="s">meta-config</span>
        <span class="na">namespace</span><span class="pi">:</span> 
    <span class="na">podSpec</span><span class="pi">:</span>
      <span class="na">containers</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">meta</span>
          <span class="na">volumeMounts</span><span class="pi">:</span>
            <span class="pi">-</span> <span class="na">mountPath</span><span class="pi">:</span> <span class="s">/etc/greptimedb</span>
              <span class="na">name</span><span class="pi">:</span> <span class="s">meta-config</span>
            <span class="pi">-</span> 
          <span class="c1"># ...</span>
</code></pre></div></div>

<h2 id="总结">总结</h2>

<p>本文分享了一些 GreptimeDB 集成 KubeBlocks 时的经验，这些都是在集成过程中碰到的真实问题与解决方法。</p>

<p>目前 GreptimeDB 只集成了 KubeBlocks 的部署能力，还有许多丰富的特性没有实施集成。争取在未来，将 GreptimeDB 集成更多 KubeBlocks 的能力。</p>

<h2 id="references">References</h2>

<ol>
  <li><a href="https://github.com/apecloud/kubeblocks">https://github.com/apecloud/kubeblocks</a></li>
  <li><a href="https://kubeblocks.io/">https://kubeblocks.io/</a></li>
  <li><a href="https://kubeblocks.io/docs/preview/user_docs/api-reference/cluster">https://kubeblocks.io/docs/preview/user_docs/api-reference/cluster</a></li>
  <li><a href="https://kubernetes.io/docs/concepts/workloads/pods/init-containers/">https://kubernetes.io/docs/concepts/workloads/pods/init-containers/</a></li>
  <li><a href="https://docs.greptime.com/developer-guide/overview#architecture">https://docs.greptime.com/developer-guide/overview#architecture</a></li>
</ol>]]></content><author><name>Your Name</name><email>shawnhxh@outlook.com</email></author><category term="post" /><category term="Kubernetes" /><summary type="html"><![CDATA[本文同为: Greptime 官方微信公众号推文：GreptimeDB 的 KubeBlocks 集成经验分享 Greptime Official Blogs: Hands-on Experience of Integrating GreptimeDB with KubeBlocks KubeBlocks 是什么 KubeBlocks 是一款由 ApeCloud 开源的云原生数据基础设施，旨在帮助应用开发者和平台工程师在 Kubernetes 上更好地管理数据库和各种分析型工作负载。KubeBlocks 支持多个云服务商，并且提供了一套声明式、统一的方式来提升 DevOps 效率。 KubeBlocks 目前支持关系型数据库、NoSQL 数据库、向量数据库、时序数据库、图数据库以及流计算系统等多种数据基础设施。]]></summary></entry><entry><title type="html">Cilium CNI: tc ReloadDatapath 工作原理解析</title><link href="https://shawnh2.github.io/post/2023/08/09/cilium-tc-reload-datapath.html" rel="alternate" type="text/html" title="Cilium CNI: tc ReloadDatapath 工作原理解析" /><published>2023-08-09T00:00:00+08:00</published><updated>2023-08-09T00:00:00+08:00</updated><id>https://shawnh2.github.io/post/2023/08/09/cilium-tc-reload-datapath</id><content type="html" xml:base="https://shawnh2.github.io/post/2023/08/09/cilium-tc-reload-datapath.html"><![CDATA[<blockquote>
  <p>本文代码基于 Cilium HEAD <a href="https://github.com/cilium/cilium/commit/40935318e344424be1ea96510c96427aef5134c3">4093531</a> 展开。</p>
</blockquote>

<p>在 Cilium CNI 中，每当 CiliumEndpoint 被创建时，都会触发<code class="language-plaintext highlighter-rouge">Loader.CompileAndLoad</code>方法的执行。在<a href="https://shawnh2.github.io/post/2023/07/18/cilium-cni-walk-through.html#compileandload">之前的文章中</a>提到过，Cilium 使用<code class="language-plaintext highlighter-rouge">tc</code>（traffic control）来将编译好的 BPF 程序加载到内核，但针对具体加载过程、加载内容并没有展开描述，因此本文借机来一探究竟。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/datapath/loader/loader.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">Loader</span><span class="p">)</span> <span class="n">CompileAndLoad</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ep</span> <span class="n">datapath</span><span class="o">.</span><span class="n">Endpoint</span><span class="p">,</span> <span class="n">stats</span> <span class="o">*</span><span class="n">metrics</span><span class="o">.</span><span class="n">SpanStat</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">ep</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">log</span><span class="o">.</span><span class="n">Fatalf</span><span class="p">(</span><span class="s">"LoadBPF() doesn't support non-endpoint load"</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="n">dirs</span> <span class="o">:=</span> <span class="n">directoryInfo</span><span class="p">{</span>
		<span class="n">Library</span><span class="o">:</span> <span class="n">option</span><span class="o">.</span><span class="n">Config</span><span class="o">.</span><span class="n">BpfDir</span><span class="p">,</span>     <span class="c">// /var/lib/cilium/bpf，存放 BPF 模版文件</span>
		<span class="n">Runtime</span><span class="o">:</span> <span class="n">option</span><span class="o">.</span><span class="n">Config</span><span class="o">.</span><span class="n">StateDir</span><span class="p">,</span>   <span class="c">// /var/run/cilium，存放 endpoint 运行状态</span>
		<span class="n">State</span><span class="o">:</span>   <span class="n">ep</span><span class="o">.</span><span class="n">StateDir</span><span class="p">(),</span>            <span class="c">// /var/run/cilium/state/{endpoint-id}</span>
		<span class="n">Output</span><span class="o">:</span>  <span class="n">ep</span><span class="o">.</span><span class="n">StateDir</span><span class="p">(),</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">l</span><span class="o">.</span><span class="n">compileAndLoad</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ep</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">dirs</span><span class="p">,</span> <span class="n">stats</span><span class="p">)</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">Loader</span><span class="p">)</span> <span class="n">compileAndLoad</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ep</span> <span class="n">datapath</span><span class="o">.</span><span class="n">Endpoint</span><span class="p">,</span> <span class="n">dirs</span> <span class="o">*</span><span class="n">directoryInfo</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">err</span> <span class="o">:=</span> <span class="n">compileDatapath</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">dirs</span><span class="p">,</span> <span class="n">ep</span><span class="o">.</span><span class="n">IsHost</span><span class="p">(),</span> <span class="n">ep</span><span class="o">.</span><span class="n">Logger</span><span class="p">(</span><span class="n">Subsystem</span><span class="p">))</span>  <span class="c">// 编译 BPF 程序</span>
	<span class="n">err</span> <span class="o">=</span> <span class="n">l</span><span class="o">.</span><span class="n">reloadDatapath</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ep</span><span class="p">,</span> <span class="n">dirs</span><span class="p">)</span>  <span class="c">// 加载 BPF 程序</span>
	<span class="k">return</span> <span class="n">err</span>
<span class="p">}</span>
</code></pre></div></div>

<!--more-->

<h2 id="reload-datapath">Reload Datapath</h2>
<p>Cilium 使用<code class="language-plaintext highlighter-rouge">Loader.reloadDatapath</code>来完成 BPF 程序的加载工作：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/datapath/loader/loader.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">Loader</span><span class="p">)</span> <span class="n">reloadDatapath</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ep</span> <span class="n">datapath</span><span class="o">.</span><span class="n">Endpoint</span><span class="p">,</span> <span class="n">dirs</span> <span class="o">*</span><span class="n">directoryInfo</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="c">// 替换当前 BPF 程序</span>
	<span class="n">objPath</span> <span class="o">:=</span> <span class="n">path</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">dirs</span><span class="o">.</span><span class="n">Output</span><span class="p">,</span> <span class="s">"bpf_lxc.o"</span><span class="p">)</span>

	<span class="c">// endpoint 是否为 host endpoint</span>
	<span class="k">if</span> <span class="n">ep</span><span class="o">.</span><span class="n">IsHost</span><span class="p">()</span> <span class="p">{</span>
		<span class="n">objPath</span> <span class="o">=</span> <span class="n">path</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">dirs</span><span class="o">.</span><span class="n">Output</span><span class="p">,</span> <span class="s">"bpf_host.o"</span><span class="p">)</span>
		<span class="n">l</span><span class="o">.</span><span class="n">reloadHostDatapath</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ep</span><span class="p">,</span> <span class="n">objPath</span><span class="p">)</span>  <span class="c">// 重载 cilium_host 上的 BPF 程序</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="n">progs</span> <span class="o">:=</span> <span class="p">[]</span><span class="n">progDefinition</span>

		<span class="k">if</span> <span class="n">ep</span><span class="o">.</span><span class="n">RequireEgressProg</span><span class="p">()</span> <span class="p">{</span>
			<span class="n">progs</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">progs</span><span class="p">,</span> <span class="n">progDefinition</span><span class="p">{</span><span class="n">progName</span><span class="o">:</span> <span class="s">"cil_to_container"</span><span class="p">,</span> <span class="n">direction</span><span class="o">:</span> <span class="s">"egress"</span><span class="p">})</span>
		<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
			<span class="n">err</span> <span class="o">:=</span> <span class="n">RemoveTCFilters</span><span class="p">(</span><span class="n">ep</span><span class="o">.</span><span class="n">InterfaceName</span><span class="p">(),</span> <span class="n">netlink</span><span class="o">.</span><span class="n">HANDLE_MIN_EGRESS</span><span class="p">)</span>  <span class="c">// 移除接口 egress 方向上所有的 filters</span>
		<span class="p">}</span>

		<span class="n">finalize</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">replaceDatapath</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ep</span><span class="o">.</span><span class="n">InterfaceName</span><span class="p">(),</span> <span class="n">objPath</span><span class="p">,</span> <span class="n">progs</span><span class="p">,</span> <span class="s">""</span><span class="p">)</span>  <span class="c">// 重载 endpoint 接口上的 BPF 程序</span>
		<span class="k">defer</span> <span class="n">finalize</span><span class="p">()</span>
	<span class="p">}</span>

	<span class="k">if</span> <span class="n">ep</span><span class="o">.</span><span class="n">RequireEndpointRoute</span><span class="p">()</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">ip</span> <span class="o">:=</span> <span class="n">ep</span><span class="o">.</span><span class="n">IPv4Address</span><span class="p">();</span> <span class="n">ip</span><span class="o">.</span><span class="n">IsValid</span><span class="p">()</span> <span class="p">{</span>  <span class="c">// 获取 endpoint 的 ipv4 地址</span>
			<span class="n">upsertEndpointRoute</span><span class="p">(</span><span class="n">ep</span><span class="p">,</span> <span class="o">*</span><span class="n">iputil</span><span class="o">.</span><span class="n">AddrToIPNet</span><span class="p">(</span><span class="n">ip</span><span class="p">))</span>
		<span class="p">}</span>
		<span class="k">if</span> <span class="n">ip</span> <span class="o">:=</span> <span class="n">ep</span><span class="o">.</span><span class="n">IPv6Address</span><span class="p">();</span> <span class="n">ip</span><span class="o">.</span><span class="n">IsValid</span><span class="p">()</span> <span class="p">{</span>
			<span class="n">upsertEndpointRoute</span><span class="p">(</span><span class="n">ep</span><span class="p">,</span> <span class="o">*</span><span class="n">iputil</span><span class="o">.</span><span class="n">AddrToIPNet</span><span class="p">(</span><span class="n">ip</span><span class="p">))</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>其中，BPF 程序的重载根据 endpoint 属性的不同，分为了两种情况：</p>

<ul>
  <li>对于 host endpoint 来说，BPF 程序<code class="language-plaintext highlighter-rouge">bpf_host.o</code>的重载发生在 endpoint 所在宿主机的<code class="language-plaintext highlighter-rouge">cilium_host</code>设备上
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ tc filter show dev cilium_host ingress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_host-cilium_host direct-action not_in_hw <span class="nb">id </span>4203 tag fd128c0c744c0771 jited

~ tc filter show dev cilium_host egress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_from_host-cilium_host direct-action not_in_hw <span class="nb">id </span>4213 tag bc5f052f5017dabd jited
</code></pre></div>    </div>
  </li>
  <li>对于普通的 endpoint 来说，BPF 程序<code class="language-plaintext highlighter-rouge">bpf_lxc.o</code>的重载发生在 endpoint 的网络接口上
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ tc filter show dev lxc9fc12c71903b ingress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_from_container-lxc9fc12c71903b direct-action not_in_hw <span class="nb">id </span>4931 tag 4cfba610f154c365 jited
</code></pre></div>    </div>
  </li>
</ul>

<h2 id="host-endpoint">Host Endpoint</h2>
<p>有关 host endpoint 的定性非常简单，就是通过 labels 来判断的。并且在 Cilium 中，该 label 用于<strong>特殊的预留（reserved）identity</strong>：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/endpoint/endpoint.go</span>

<span class="k">func</span> <span class="n">parseEndpoint</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">owner</span> <span class="n">regeneration</span><span class="o">.</span><span class="n">Owner</span><span class="p">,</span> <span class="o">...</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">Endpoint</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="c">// ...</span>

	<span class="c">// 若有 key 为 "reserved:host" label 的 endpoint 即为 host endpoint</span>
	<span class="n">ep</span><span class="o">.</span><span class="n">isHost</span> <span class="o">=</span> <span class="n">ep</span><span class="o">.</span><span class="n">HasLabels</span><span class="p">(</span><span class="n">labels</span><span class="o">.</span><span class="n">LabelHost</span><span class="p">)</span>

	<span class="c">// ...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>host endpoint 是一种特殊的 endpoint，可以将其认为是从 localhost 抽象的一个 endpoint。从它的配置可以看出，host endpoint 对应<code class="language-plaintext highlighter-rouge">cilium_host</code>网络接口。</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ kubectl <span class="nt">-n</span> kube-system <span class="nb">exec </span>cilium-k6rxc <span class="nt">--</span> cilium endpoint get <span class="nt">-l</span> reserved:host
<span class="c"># ...</span>
  <span class="s2">"networking"</span>: <span class="o">{</span>
    <span class="s2">"addressing"</span>: <span class="o">[</span>
      <span class="o">{}</span>
    <span class="o">]</span>,
    <span class="s2">"host-mac"</span>: <span class="s2">"be:00:72:df:07:5a"</span>,
    <span class="s2">"interface-name"</span>: <span class="s2">"cilium_host"</span>,  <span class="c"># 接口名</span>
    <span class="s2">"mac"</span>: <span class="s2">"be:00:72:df:07:5a"</span>        <span class="c"># 接口mac地址</span>
  <span class="o">}</span>,
<span class="c"># ...</span>
</code></pre></div></div>
<p>实际上，<code class="language-plaintext highlighter-rouge">cilium_host</code>接口对应的 ip 地址就是 <a href="https://shawnh2.github.io/post/2023/07/18/cilium-cni-walk-through.html#cilium-internal-ip">Cilium Internal IP</a>：</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ ip addr
<span class="c"># ...</span>
5: cilium_host@cilium_net: &lt;BROADCAST,MULTICAST,NOARP,UP,LOWER_UP&gt; mtu 65535 qdisc noqueue state UP group default qlen 1000
    <span class="nb">link</span>/ether be:00:72:df:07:5a brd ff:ff:ff:ff:ff:ff
    inet 10.244.2.110/32 scope global cilium_host
<span class="c"># ...</span>

~ kubectl get cn kind-worker
NAME                 CILIUMINTERNALIP   INTERNALIP   AGE
kind-worker          10.244.2.110       172.19.0.4   17h
</code></pre></div></div>
<p>值得注意的是，在 host 的根命名空间下，一共存在四个虚拟网络接口：</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">cilium_vxlan</code>，负责对数据包在 vxlan 中的解、封装操作</li>
  <li><code class="language-plaintext highlighter-rouge">cilium_host</code>和<code class="language-plaintext highlighter-rouge">cilium_net</code>，它们实质上是一对 veth-pair
    <ul>
      <li><code class="language-plaintext highlighter-rouge">cilium_host</code>用作节点所在集群子网的网关，因为在 <a href="https://shawnh2.github.io/post/2023/07/18/cilium-cni-walk-through.html#endpoint-%E8%B7%AF%E7%94%B1%E7%94%9F%E6%88%90">endpoint 生成的路由</a>中，Cilium Internal IP 充当了 endpoint 的默认网关</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">lxc_health</code>，负责 endpoint 间的健康检查</li>
</ul>

<h3 id="reloadhostdatapath">reloadHostDatapath</h3>
<p>对于 host endpoint 来说，先通过<code class="language-plaintext highlighter-rouge">reloadHostDatapath</code>方法来准备所有需要被加载的 BPF 程序，最后再调用<code class="language-plaintext highlighter-rouge">replaceDatapath</code>函数完成对 BPF 程序的重载。有关<code class="language-plaintext highlighter-rouge">replaceDatapath</code>函数的分析，见后续章节描述。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/datapath/loader/loader.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">Loader</span><span class="p">)</span> <span class="n">reloadHostDatapath</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ep</span> <span class="n">datapath</span><span class="o">.</span><span class="n">Endpoint</span><span class="p">,</span> <span class="n">objPath</span> <span class="kt">string</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">nbInterfaces</span> <span class="o">:=</span> <span class="nb">len</span><span class="p">(</span><span class="n">option</span><span class="o">.</span><span class="n">Config</span><span class="o">.</span><span class="n">GetDevices</span><span class="p">())</span> <span class="o">+</span> <span class="m">2</span>  <span class="c">// default: 2</span>
	<span class="n">symbols</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="kt">string</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="n">nbInterfaces</span><span class="p">)</span>
	<span class="n">directions</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="kt">string</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="n">nbInterfaces</span><span class="p">)</span>
	<span class="n">objPaths</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="kt">string</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="n">nbInterfaces</span><span class="p">)</span>
	<span class="n">interfaceNames</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="kt">string</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="n">nbInterfaces</span><span class="p">)</span>
	<span class="n">symbols</span><span class="p">[</span><span class="m">0</span><span class="p">],</span> <span class="n">symbols</span><span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="o">=</span> <span class="s">"cil_to_host"</span><span class="p">,</span> <span class="s">"cil_from_host"</span>
	<span class="n">directions</span><span class="p">[</span><span class="m">0</span><span class="p">],</span> <span class="n">directions</span><span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="o">=</span> <span class="s">"ingress"</span><span class="p">,</span> <span class="s">"egress"</span>
	<span class="n">objPaths</span><span class="p">[</span><span class="m">0</span><span class="p">],</span> <span class="n">objPaths</span><span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">objPath</span><span class="p">,</span> <span class="n">objPath</span>
	<span class="n">interfaceNames</span><span class="p">[</span><span class="m">0</span><span class="p">],</span> <span class="n">interfaceNames</span><span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">ep</span><span class="o">.</span><span class="n">InterfaceName</span><span class="p">(),</span> <span class="n">ep</span><span class="o">.</span><span class="n">InterfaceName</span><span class="p">()</span>

	<span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">netlink</span><span class="o">.</span><span class="n">LinkByName</span><span class="p">(</span><span class="s">"cilium_net"</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">err</span>  <span class="c">// cilium_net 和 cilium_host 成对出现，若对端接口不存在，则直接返回错误</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="c">// 对于 cilium_net 接口来说，其只需要在 ingress 方向上加载 BPF 程序即可</span>
		<span class="n">interfaceNames</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">interfaceNames</span><span class="p">,</span> <span class="s">"cilium_net"</span><span class="p">)</span>
		<span class="n">symbols</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">symbols</span><span class="p">,</span> <span class="s">"cil_to_host"</span><span class="p">)</span>
		<span class="n">directions</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">directions</span><span class="p">,</span> <span class="s">"ingress"</span><span class="p">)</span>
		<span class="n">secondDevObjPath</span> <span class="o">:=</span> <span class="n">path</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">ep</span><span class="o">.</span><span class="n">StateDir</span><span class="p">(),</span> <span class="s">"bpf_host_cilium_net.o"</span><span class="p">)</span>
		<span class="n">err</span> <span class="o">:=</span> <span class="n">patchHostNetdevDatapath</span><span class="p">(</span><span class="n">ep</span><span class="p">,</span> <span class="n">objPath</span><span class="p">,</span> <span class="n">secondDevObjPath</span><span class="p">,</span> <span class="s">"cilium_net"</span><span class="p">,</span> <span class="no">nil</span><span class="p">)</span>  <span class="c">// 填充一些接口信息</span>
		<span class="n">objPaths</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">objPaths</span><span class="p">,</span> <span class="n">secondDevObjPath</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="n">bpfMasqIPv4Addrs</span> <span class="o">:=</span> <span class="n">node</span><span class="o">.</span><span class="n">GetMasqIPv4AddrsWithDevices</span><span class="p">()</span>

	<span class="c">// 默认情况下该配置项为空，故一般不执行此循环</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">device</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">option</span><span class="o">.</span><span class="n">Config</span><span class="o">.</span><span class="n">GetDevices</span><span class="p">()</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">netlink</span><span class="o">.</span><span class="n">LinkByName</span><span class="p">(</span><span class="n">device</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>

		<span class="n">netdevObjPath</span> <span class="o">:=</span> <span class="n">path</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">ep</span><span class="o">.</span><span class="n">StateDir</span><span class="p">(),</span> <span class="s">"bpf_netdev_"</span><span class="o">+</span><span class="n">device</span><span class="o">+</span><span class="s">".o"</span><span class="p">)</span>
		<span class="n">err</span> <span class="o">:=</span> <span class="n">patchHostNetdevDatapath</span><span class="p">(</span><span class="n">ep</span><span class="p">,</span> <span class="n">objPath</span><span class="p">,</span> <span class="n">netdevObjPath</span><span class="p">,</span> <span class="n">device</span><span class="p">,</span> <span class="n">bpfMasqIPv4Addrs</span><span class="p">)</span>
		<span class="n">objPaths</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">objPaths</span><span class="p">,</span> <span class="n">netdevObjPath</span><span class="p">)</span>
		<span class="n">interfaceNames</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">interfaceNames</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>
		<span class="n">symbols</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">symbols</span><span class="p">,</span> <span class="s">"cil_from_netdev"</span><span class="p">)</span>
		<span class="n">directions</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">directions</span><span class="p">,</span> <span class="s">"ingress"</span><span class="p">)</span>

		<span class="c">// ... 判断是否需要加载 cil_to_netdev 到接口 egress 方向</span>
	<span class="p">}</span>

	<span class="c">// 针对每个接口，分别重载属于该接口、接口方向的 BPF 程序</span>
	<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">interfaceName</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">interfaceNames</span> <span class="p">{</span>
		<span class="n">symbol</span> <span class="o">:=</span> <span class="n">symbols</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
		<span class="n">progs</span> <span class="o">:=</span> <span class="p">[]</span><span class="n">progDefinition</span>
		<span class="n">finalize</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">replaceDatapath</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">interfaceName</span><span class="p">,</span> <span class="n">objPaths</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">progs</span><span class="p">,</span> <span class="s">""</span><span class="p">)</span>  <span class="c">// ***</span>
		<span class="k">defer</span> <span class="n">finalize</span><span class="p">()</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在此方法的实现中，可以发现：针对 host endpoint，其不止在<code class="language-plaintext highlighter-rouge">cilium_host</code>接口的 ingress/egress 两个方向上都加载了 BPF 程序，还为其对端<code class="language-plaintext highlighter-rouge">cilium_net</code>的 ingress 方向也加载了 BPF 程序。最终，<code class="language-plaintext highlighter-rouge">cilium_host</code>和<code class="language-plaintext highlighter-rouge">cilium_net</code>形成如下图所示的一种关系：</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-08-09/cilium-host-net.png" alt="cilium-host-net" /></p>

<p>其次，若用户通过<code class="language-plaintext highlighter-rouge">daemonConfig.devices</code>指定了 bpf_host 设备，则 Cilium 会专门为这些设备载入名为<code class="language-plaintext highlighter-rouge">bpf_netdev_${device}.o</code>的程序。但一般该功能只在宿主机启用防火墙或启动 BPF NodePort 等情况下才使用。</p>
<h3 id="bpf-cil-to-host">bpf: cil-to-host</h3>
<p>Cilium 在<code class="language-plaintext highlighter-rouge">cilium_host</code>接口上重载的两个 BPF 程序分别为：<code class="language-plaintext highlighter-rouge">cil-from-host</code>和<code class="language-plaintext highlighter-rouge">cil-to-host</code>。</p>

<p>其中，在 ingress 方向上，重载的<code class="language-plaintext highlighter-rouge">cil-from-host</code>BPF 程序存在以下调用栈（以 IPv4 为例）：</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|- cil_to_host                             @ bpf/bpf_host.c
   |- ipv4_host_policy_ingress             @ bpf/lib/host_firewall.h
      |- ipv4_host_policy_ingress_lookup
      |- __ipv4_host_policy_ingress
</code></pre></div></div>
<p>在<code class="language-plaintext highlighter-rouge">ipv4_host_policy_ingress_lookup</code>中，先使用数据包的目的地址进行了 endpoint 的身份检查，并且只针对目的身份为<code class="language-plaintext highlighter-rouge">cilium_host</code>（即 host endpoint）的数据包进行后续 ingress policy 的执行：</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">__always_inline</span> <span class="n">bool</span>
<span class="nf">ipv4_host_policy_ingress_lookup</span><span class="p">(</span><span class="k">struct</span> <span class="n">__ctx_buff</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">iphdr</span> <span class="o">*</span><span class="n">ip4</span><span class="p">,</span> <span class="k">struct</span> <span class="n">ct_buffer4</span> <span class="o">*</span><span class="n">ct_buffer</span><span class="p">)</span>
<span class="p">{</span>
	<span class="kt">int</span> <span class="n">l4_off</span><span class="p">,</span> <span class="n">l3_off</span> <span class="o">=</span> <span class="n">ETH_HLEN</span><span class="p">;</span>
	<span class="n">__u32</span> <span class="n">dst_sec_identity</span> <span class="o">=</span> <span class="n">WORLD_ID</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">remote_endpoint_info</span> <span class="o">*</span><span class="n">info</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">ipv4_ct_tuple</span> <span class="o">*</span><span class="n">tuple</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">ct_buffer</span><span class="o">-&gt;</span><span class="n">tuple</span><span class="p">;</span>

	<span class="cm">/* 获取目的地址所指 endpoint 的 identity */</span>
	<span class="n">info</span> <span class="o">=</span> <span class="n">lookup_ip4_remote_endpoint</span><span class="p">(</span><span class="n">ip4</span><span class="o">-&gt;</span><span class="n">daddr</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">info</span> <span class="o">&amp;&amp;</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">sec_identity</span><span class="p">)</span>
		<span class="n">dst_sec_identity</span> <span class="o">=</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">sec_identity</span><span class="p">;</span>

	<span class="cm">/* 只针对目的 ID 为 host 类型的 endpoint 施加 host policy 计算 */</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">dst_sec_identity</span> <span class="o">!=</span> <span class="n">HOST_ID</span><span class="p">)</span>
		<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>

	<span class="cm">/* 在 conntrack map 中寻找连接 */</span>
	<span class="n">tuple</span><span class="o">-&gt;</span><span class="n">nexthdr</span> <span class="o">=</span> <span class="n">ip4</span><span class="o">-&gt;</span><span class="n">protocol</span><span class="p">;</span>
	<span class="n">tuple</span><span class="o">-&gt;</span><span class="n">daddr</span> <span class="o">=</span> <span class="n">ip4</span><span class="o">-&gt;</span><span class="n">daddr</span><span class="p">;</span>
	<span class="n">tuple</span><span class="o">-&gt;</span><span class="n">saddr</span> <span class="o">=</span> <span class="n">ip4</span><span class="o">-&gt;</span><span class="n">saddr</span><span class="p">;</span>
	<span class="n">l4_off</span> <span class="o">=</span> <span class="n">l3_off</span> <span class="o">+</span> <span class="n">ipv4_hdrlen</span><span class="p">(</span><span class="n">ip4</span><span class="p">);</span>
	<span class="n">ct_buffer</span><span class="o">-&gt;</span><span class="n">ret</span> <span class="o">=</span> <span class="n">ct_lookup4</span><span class="p">(</span><span class="n">get_ct_map4</span><span class="p">(</span><span class="n">tuple</span><span class="p">),</span> <span class="n">tuple</span><span class="p">,</span> <span class="n">ctx</span><span class="p">,</span> <span class="n">l4_off</span><span class="p">,</span> <span class="n">CT_INGRESS</span><span class="p">,</span>
				    <span class="o">&amp;</span><span class="n">ct_buffer</span><span class="o">-&gt;</span><span class="n">ct_state</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ct_buffer</span><span class="o">-&gt;</span><span class="n">monitor</span><span class="p">);</span>

	<span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>对于那些目的 endpoint 非 host 类型的数据包，则直接在<code class="language-plaintext highlighter-rouge">ipv4_host_policy_ingress</code>中返回<code class="language-plaintext highlighter-rouge">CTX_ACT_OK</code>，无需执行 后续函数。而对于那些参与 ingress policy 计算的数据包，则会执行<code class="language-plaintext highlighter-rouge">__ipv4_host_policy_ingress</code>：</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">__always_inline</span> <span class="kt">int</span>
<span class="nf">__ipv4_host_policy_ingress</span><span class="p">(</span><span class="k">struct</span> <span class="n">__ctx_buff</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">iphdr</span> <span class="o">*</span><span class="n">ip4</span><span class="p">,</span>
			   <span class="k">struct</span> <span class="n">ct_buffer4</span> <span class="o">*</span><span class="n">ct_buffer</span><span class="p">,</span> <span class="n">__u32</span> <span class="o">*</span><span class="n">src_sec_identity</span><span class="p">,</span>
			   <span class="k">struct</span> <span class="n">trace_ctx</span> <span class="o">*</span><span class="n">trace</span><span class="p">,</span> <span class="n">__s8</span> <span class="o">*</span><span class="n">ext_err</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">ct_state</span> <span class="n">ct_state_new</span> <span class="o">=</span> <span class="p">{};</span>
	<span class="k">struct</span> <span class="n">ct_state</span> <span class="o">*</span><span class="n">ct_state</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">ct_buffer</span><span class="o">-&gt;</span><span class="n">ct_state</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">ipv4_ct_tuple</span> <span class="o">*</span><span class="n">tuple</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">ct_buffer</span><span class="o">-&gt;</span><span class="n">tuple</span><span class="p">;</span>
	<span class="n">__u16</span> <span class="n">node_id</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
	<span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">ct_buffer</span><span class="o">-&gt;</span><span class="n">ret</span><span class="p">;</span>
	<span class="kt">int</span> <span class="n">verdict</span> <span class="o">=</span> <span class="n">CTX_ACT_OK</span><span class="p">;</span>
	<span class="n">__u8</span> <span class="n">policy_match_type</span> <span class="o">=</span> <span class="n">POLICY_MATCH_NONE</span><span class="p">;</span>
	<span class="n">__u8</span> <span class="n">audited</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">remote_endpoint_info</span> <span class="o">*</span><span class="n">info</span><span class="p">;</span>
	<span class="n">bool</span> <span class="n">is_untracked_fragment</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
	<span class="n">__u16</span> <span class="n">proxy_port</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

	<span class="cm">/* 根据源 IP 地址获取源 endpoint 的 identity */</span>
	<span class="n">info</span> <span class="o">=</span> <span class="n">lookup_ip4_remote_endpoint</span><span class="p">(</span><span class="n">ip4</span><span class="o">-&gt;</span><span class="n">saddr</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">info</span> <span class="o">&amp;&amp;</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">sec_identity</span><span class="p">)</span> <span class="p">{</span>
		<span class="o">*</span><span class="n">src_sec_identity</span> <span class="o">=</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">sec_identity</span><span class="p">;</span>
		<span class="n">node_id</span> <span class="o">=</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">node_id</span><span class="p">;</span>
	<span class="p">}</span>

	<span class="cm">/* 查询 policy 并计算该数据包能否通过 ingress 进入接口，返回判决结果 */</span>
	<span class="n">verdict</span> <span class="o">=</span> <span class="n">policy_can_access_ingress</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">*</span><span class="n">src_sec_identity</span><span class="p">,</span> <span class="n">HOST_ID</span><span class="p">,</span> <span class="n">tuple</span><span class="o">-&gt;</span><span class="n">dport</span><span class="p">,</span> <span class="n">tuple</span><span class="o">-&gt;</span><span class="n">nexthdr</span><span class="p">,</span>
                                        <span class="n">is_untracked_fragment</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">policy_match_type</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">audited</span><span class="p">,</span> <span class="n">ext_err</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">proxy_port</span><span class="p">);</span>

	<span class="cm">/* 只有该连接被接受时，才在 conntrack map 中创建新的 CT 项 */</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">ret</span> <span class="o">==</span> <span class="n">CT_NEW</span> <span class="o">&amp;&amp;</span> <span class="n">verdict</span> <span class="o">==</span> <span class="n">CTX_ACT_OK</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">ct_state_new</span><span class="p">.</span><span class="n">src_sec_id</span> <span class="o">=</span> <span class="o">*</span><span class="n">src_sec_identity</span><span class="p">;</span>
		<span class="n">ct_state_new</span><span class="p">.</span><span class="n">node_port</span> <span class="o">=</span> <span class="n">ct_state</span><span class="o">-&gt;</span><span class="n">node_port</span><span class="p">;</span>
		<span class="n">ret</span> <span class="o">=</span> <span class="n">ct_create4</span><span class="p">(</span><span class="n">get_ct_map4</span><span class="p">(</span><span class="n">tuple</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">CT_MAP_ANY4</span><span class="p">,</span> <span class="n">tuple</span><span class="p">,</span>
				 <span class="n">ctx</span><span class="p">,</span> <span class="n">CT_INGRESS</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ct_state_new</span><span class="p">,</span> <span class="n">proxy_port</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">false</span><span class="p">,</span> <span class="n">ext_err</span><span class="p">);</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">ret</span><span class="p">))</span> <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
	<span class="p">}</span>

<span class="nl">out:</span>
	<span class="cm">/* 将数据包从 lxc 设备重定向到 host 设备 */</span>
	<span class="n">ctx_change_type</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">PACKET_HOST</span><span class="p">);</span>
	<span class="k">return</span> <span class="n">verdict</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>该函数主要通过<code class="language-plaintext highlighter-rouge">policy_can_access_ingress</code><strong>计算 ingress 上的 policy 是否允许数据包进入</strong>。在 policy 匹配阶段，Cilium 先从 Map 中读取出 policy，再进行匹配。Cilium 将 policy 的匹配分为了六种优先级（从 1～6 优先度依次递减，如下表所示）。Policy 的每种优先级都由三个匹配维度来描述，其中 <strong>ID 属于 L3 匹配特征，协议和端口均属于 L4 匹配特征</strong>。这三个匹配维度正好描述了 Cilium 所定义的 NetworkPolicy 类型的 CRD，以<code class="language-plaintext highlighter-rouge">CiliumClusterwideNetworkPolicy</code>为例，<a href="https://doc.crds.dev/github.com/cilium/cilium/cilium.io/CiliumClusterwideNetworkPolicy/v2@v1.14.0-snapshot.4#spec-ingress">其 ingress 的 spec</a> 都是围绕这三个维度展开的。</p>

<table>
  <thead>
    <tr>
      <th>Precedence</th>
      <th>Policy Match</th>
      <th>Match Type</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>id/proto/port</td>
      <td>L3/L4</td>
    </tr>
    <tr>
      <td>2</td>
      <td>ANY/proto/port</td>
      <td>L4-only</td>
    </tr>
    <tr>
      <td>3</td>
      <td>id/proto/ANY</td>
      <td>L3-proto</td>
    </tr>
    <tr>
      <td>4</td>
      <td>ANY/proto/ANY</td>
      <td>Proto-only</td>
    </tr>
    <tr>
      <td>5</td>
      <td>id/ANY/ANY</td>
      <td>L3-only</td>
    </tr>
    <tr>
      <td>6</td>
      <td>ANY/ANY/ANY</td>
      <td>All</td>
    </tr>
  </tbody>
</table>

<h2 id="endpoint">Endpoint</h2>
<p>无论 endpoint 的类型如何，它们最终都要执行<code class="language-plaintext highlighter-rouge">replaceDatapath</code>函数。</p>
<h3 id="replacedatapath">replaceDatapath</h3>
<p>该函数首先解析 BPF ELF 文件为 CollectionSpec，并将其加载至内核。由于每次都是将 CollectionSpec 固定（pin）到 bpffs 的一个路径上，并加载为一个 Map，所以只要在 Map 类型、key/value 大小、flags 和最大实例数这几个特征不变的情况下，Cilium 可以复用同一个 Map。但若发生改变，则需进行 bpffs Map 的迁移操作（<code class="language-plaintext highlighter-rouge">BPFFSMigration</code>，即 re-pin）。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/datapath/loader/netlink.go</span>

<span class="k">func</span> <span class="n">replaceDatapath</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ifName</span><span class="p">,</span> <span class="n">objPath</span> <span class="kt">string</span><span class="p">,</span> <span class="n">progs</span> <span class="p">[]</span><span class="n">progDefinition</span><span class="p">,</span> <span class="n">xdpMode</span> <span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="k">func</span><span class="p">(),</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>

	<span class="n">link</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">netlink</span><span class="o">.</span><span class="n">LinkByName</span><span class="p">(</span><span class="n">ifName</span><span class="p">)</span>

	<span class="c">// 从磁盘加载 eBPF ELF 文件，并解析为 CollectionSpec</span>
	<span class="n">spec</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">bpf</span><span class="o">.</span><span class="n">LoadCollectionSpec</span><span class="p">(</span><span class="n">objPath</span><span class="p">)</span>

	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">prog</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">progs</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">spec</span><span class="o">.</span><span class="n">Programs</span><span class="p">[</span><span class="n">prog</span><span class="o">.</span><span class="n">progName</span><span class="p">]</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>  <span class="c">// 查询重载的程序是否包含 BPF 程序中</span>
			<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="c">// not-found</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="c">// 加载 CollectionSpec 至内核，并 pin 在 bpffs 的 TCGlobalsPath 路径上</span>
	<span class="n">finalize</span> <span class="o">:=</span> <span class="k">func</span><span class="p">()</span> <span class="p">{}</span>
	<span class="n">opts</span> <span class="o">:=</span> <span class="n">ebpf</span><span class="o">.</span><span class="n">CollectionOptions</span><span class="p">{</span>
		<span class="n">Maps</span><span class="o">:</span> <span class="n">ebpf</span><span class="o">.</span><span class="n">MapOptions</span><span class="p">{</span><span class="n">PinPath</span><span class="o">:</span> <span class="n">bpf</span><span class="o">.</span><span class="n">TCGlobalsPath</span><span class="p">()},</span>
	<span class="p">}</span>
	<span class="n">coll</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">bpf</span><span class="o">.</span><span class="n">LoadCollection</span><span class="p">(</span><span class="n">spec</span><span class="p">,</span> <span class="n">opts</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">errors</span><span class="o">.</span><span class="n">Is</span><span class="p">(</span><span class="n">err</span><span class="p">,</span> <span class="n">ebpf</span><span class="o">.</span><span class="n">ErrMapIncompatible</span><span class="p">)</span> <span class="p">{</span>
		<span class="c">// 若路径上原有的 spec 与现加载的 spec 不同，就尝试重新加载新的 spec</span>
		<span class="n">err</span> <span class="o">:=</span> <span class="n">bpf</span><span class="o">.</span><span class="n">StartBPFFSMigration</span><span class="p">(</span><span class="n">bpf</span><span class="o">.</span><span class="n">TCGlobalsPath</span><span class="p">(),</span> <span class="n">spec</span><span class="p">)</span>

		<span class="n">finalize</span> <span class="o">=</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
			<span class="n">bpf</span><span class="o">.</span><span class="n">FinalizeBPFFSMigration</span><span class="p">(</span><span class="n">bpf</span><span class="o">.</span><span class="n">TCGlobalsPath</span><span class="p">(),</span> <span class="n">spec</span><span class="p">,</span> <span class="no">false</span><span class="p">)</span>  <span class="c">// 删除现有加载 maps</span>
		<span class="p">}</span>

		<span class="c">// 上述重新加载完毕后，再次重试加载 CollectionSpec</span>
		<span class="n">coll</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">bpf</span><span class="o">.</span><span class="n">LoadCollection</span><span class="p">(</span><span class="n">spec</span><span class="p">,</span> <span class="n">opts</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="k">var</span> <span class="n">ve</span> <span class="o">*</span><span class="n">ebpf</span><span class="o">.</span><span class="n">VerifierError</span>
	<span class="k">if</span> <span class="n">errors</span><span class="o">.</span><span class="n">As</span><span class="p">(</span><span class="n">err</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ve</span><span class="p">)</span> <span class="p">{</span>
		<span class="c">// Verifier error</span>
	<span class="p">}</span>
	<span class="k">defer</span> <span class="n">coll</span><span class="o">.</span><span class="n">Close</span><span class="p">()</span>

	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">prog</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">progs</span> <span class="p">{</span>
		<span class="c">// 将程序挂载到接口上</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">attachProgram</span><span class="p">(</span><span class="n">link</span><span class="p">,</span> <span class="n">coll</span><span class="o">.</span><span class="n">Programs</span><span class="p">[</span><span class="n">prog</span><span class="o">.</span><span class="n">progName</span><span class="p">],</span> <span class="n">prog</span><span class="o">.</span><span class="n">progName</span><span class="p">,</span> <span class="n">directionToParent</span><span class="p">(</span><span class="n">prog</span><span class="o">.</span><span class="n">direction</span><span class="p">),</span> <span class="n">xdpModeToFlag</span><span class="p">(</span><span class="n">xdpMode</span><span class="p">));</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">bpf</span><span class="o">.</span><span class="n">FinalizeBPFFSMigration</span><span class="p">(</span><span class="n">bpf</span><span class="o">.</span><span class="n">TCGlobalsPath</span><span class="p">(),</span> <span class="n">spec</span><span class="p">,</span> <span class="no">true</span><span class="p">)</span>  <span class="c">// 回滚到原有 maps</span>
			<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">finalize</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>挂载 BPF 程序的工作，由<code class="language-plaintext highlighter-rouge">attachProgram</code>函数完成。该函数在不指定<code class="language-plaintext highlighter-rouge">xdpFlags</code>的情况下，<strong>默认将 BPF 程序挂载到网络接口上</strong>，而非 XDP 上。接口的排队规则（qdisc）被定义为<code class="language-plaintext highlighter-rouge">clsact</code>类型，所有的 BPF 程序都以 FD 的形式关联到 filter，并挂载到接口的 qdisc 之上。值得注意的是，每个 BPF 程序都启用了<code class="language-plaintext highlighter-rouge">direct-action</code>模式，即允许 classifier 和 action 作为一个整体运行。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">attachProgram</span><span class="p">(</span><span class="n">link</span> <span class="n">netlink</span><span class="o">.</span><span class="n">Link</span><span class="p">,</span> <span class="n">prog</span> <span class="o">*</span><span class="n">ebpf</span><span class="o">.</span><span class="n">Program</span><span class="p">,</span> <span class="n">progName</span> <span class="kt">string</span><span class="p">,</span> <span class="n">qdiscParent</span> <span class="kt">uint32</span><span class="p">,</span> <span class="n">xdpFlags</span> <span class="kt">uint32</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">prog</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">errors</span><span class="o">.</span><span class="n">New</span><span class="p">(</span><span class="s">"cannot attach a nil program"</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="k">if</span> <span class="n">xdpFlags</span> <span class="o">!=</span> <span class="m">0</span> <span class="p">{</span>
		<span class="c">// 挂载程序到 XDP</span>
		<span class="n">netlink</span><span class="o">.</span><span class="n">LinkSetXdpFdWithFlags</span><span class="p">(</span><span class="n">link</span><span class="p">,</span> <span class="n">prog</span><span class="o">.</span><span class="n">FD</span><span class="p">(),</span> <span class="kt">int</span><span class="p">(</span><span class="n">xdpFlags</span><span class="p">))</span>
		<span class="k">return</span> <span class="no">nil</span>
	<span class="p">}</span>

	<span class="n">err</span> <span class="o">:=</span> <span class="n">replaceQdisc</span><span class="p">(</span><span class="n">link</span><span class="p">)</span>  <span class="c">// 替换接口现有的 clsact qdisc</span>

	<span class="n">filter</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">netlink</span><span class="o">.</span><span class="n">BpfFilter</span><span class="p">{</span>
		<span class="n">FilterAttrs</span><span class="o">:</span> <span class="n">netlink</span><span class="o">.</span><span class="n">FilterAttrs</span><span class="p">{</span>
			<span class="n">LinkIndex</span><span class="o">:</span> <span class="n">link</span><span class="o">.</span><span class="n">Attrs</span><span class="p">()</span><span class="o">.</span><span class="n">Index</span><span class="p">,</span>
			<span class="n">Parent</span><span class="o">:</span>    <span class="n">qdiscParent</span><span class="p">,</span>
			<span class="n">Handle</span><span class="o">:</span>    <span class="m">1</span><span class="p">,</span>
			<span class="n">Protocol</span><span class="o">:</span>  <span class="n">unix</span><span class="o">.</span><span class="n">ETH_P_ALL</span><span class="p">,</span>
			<span class="n">Priority</span><span class="o">:</span>  <span class="n">option</span><span class="o">.</span><span class="n">Config</span><span class="o">.</span><span class="n">TCFilterPriority</span><span class="p">,</span>
		<span class="p">},</span>
		<span class="n">Fd</span><span class="o">:</span>           <span class="n">prog</span><span class="o">.</span><span class="n">FD</span><span class="p">(),</span>
		<span class="n">Name</span><span class="o">:</span>         <span class="n">fmt</span><span class="o">.</span><span class="n">Sprintf</span><span class="p">(</span><span class="s">"%s-%s"</span><span class="p">,</span> <span class="n">progName</span><span class="p">,</span> <span class="n">link</span><span class="o">.</span><span class="n">Attrs</span><span class="p">()</span><span class="o">.</span><span class="n">Name</span><span class="p">),</span>
		<span class="n">DirectAction</span><span class="o">:</span> <span class="no">true</span><span class="p">,</span>  <span class="c">// 启用 direct-action 模式</span>
	<span class="p">}</span>

	<span class="n">err</span> <span class="o">:=</span> <span class="n">netlink</span><span class="o">.</span><span class="n">FilterReplace</span><span class="p">(</span><span class="n">filter</span><span class="p">)</span>  <span class="c">// 替换现有的 tc filter</span>

	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>挂载的结果都可以通过 tc 命令观察到：</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ tc qdisc show dev lxc0a9a490923c0
qdisc noqueue 0: root refcnt 2
qdisc clsact ffff: parent ffff:fff1

~ tc filter show dev lxc0a9a490923c0 ingress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_from_container-lxc0a9a490923c0 direct-action not_in_hw <span class="nb">id </span>2562 tag 8b558784f2a7a755 jited
</code></pre></div></div>
<h3 id="bpf-cil-from-container">bpf: cil-from-container</h3>
<p><code class="language-plaintext highlighter-rouge">cil-from-container</code>是 Cilium 加载到 endpoint 接口 ingress 方向上的 BPF 程序。该程序存在以下调用栈（以 IPv4 为例）：</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|- cil_from_container                                       @ bpf/bpf_lxc.c
   |- ep_tail_call(ctx, CILIUM_CALL_IPV4_FROM_LXC)          @ bpf/lib/maps.h
           ||         \                    /
      tail_call_static(ctx, &amp;CALLS_MAP, index)              @ bpf/include/bpf/tailcall.h
                                 |
                       struct bpf_elf_map __section_maps CALLS_MAP = { // 每个 endpoint 用于内部 tail calls 的私有 map
                         .type       = BPF_MAP_TYPE_PROG_ARRAY,  // 特殊类型的 Map，存储自定义 index 到 bpf_program_fd 的映射
                         .id         = CILIUM_MAP_CALLS,
                         .size_key   = sizeof(__u32),
                         .size_value = sizeof(__u32),
                         .pinning    = PIN_GLOBAL_NS,
                         .max_elem   = CILIUM_CALL_SIZE,
                       };
</code></pre></div></div>
<p>最终该程序执行 <a href="https://docs.cilium.io/en/stable/bpf/architecture/#tail-calls">tail calls</a>，将传入的各参数值通过汇编代码加载到各寄存器内，并<mark>调用一个标号为 12 的函数（？）</mark>。</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// bpf/include/bpf/tailcall.h</span>

<span class="k">static</span> <span class="n">__always_inline</span> <span class="n">__maybe_unused</span> <span class="kt">void</span>
<span class="nf">tail_call_static</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">__ctx_buff</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">map</span><span class="p">,</span> <span class="k">const</span> <span class="n">__u32</span> <span class="n">slot</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">__builtin_constant_p</span><span class="p">(</span><span class="n">slot</span><span class="p">))</span>  <span class="c1">// 检查 slot 变量值是否合法</span>
		<span class="n">__throw_build_bug</span><span class="p">();</span>

	<span class="n">asm</span> <span class="k">volatile</span><span class="p">(</span><span class="s">"r1 = %[ctx]</span><span class="se">\n\t</span><span class="s">"</span>      <span class="c1">// 将变量 ctx 的值加载到寄存器 r1 内</span>
		     <span class="s">"r2 = %[map]</span><span class="se">\n\t</span><span class="s">"</span>      <span class="c1">// 将变量 map 的值加载到寄存器 r2 内</span>
		     <span class="s">"r3 = %[slot]</span><span class="se">\n\t</span><span class="s">"</span>     <span class="c1">// 将变量 slot 的值加载到寄存器 r3 内</span>
		     <span class="s">"call 12</span><span class="se">\n\t</span><span class="s">"</span>          <span class="c1">// 调用函数</span>
		     <span class="o">::</span> <span class="p">[</span><span class="n">ctx</span><span class="p">]</span><span class="s">"r"</span><span class="p">(</span><span class="n">ctx</span><span class="p">),</span> <span class="p">[</span><span class="n">map</span><span class="p">]</span><span class="s">"r"</span><span class="p">(</span><span class="n">map</span><span class="p">),</span> <span class="p">[</span><span class="n">slot</span><span class="p">]</span><span class="s">"i"</span><span class="p">(</span><span class="n">slot</span><span class="p">)</span>  <span class="c1">// 输出操作数列表</span>
		     <span class="o">:</span> <span class="s">"r0"</span><span class="p">,</span> <span class="s">"r1"</span><span class="p">,</span> <span class="s">"r2"</span><span class="p">,</span> <span class="s">"r3"</span><span class="p">,</span> <span class="s">"r4"</span><span class="p">,</span> <span class="s">"r5"</span><span class="p">);</span>            <span class="c1">// 输入操作数列表</span>
<span class="p">}</span>
</code></pre></div></div>
<p>由<code class="language-plaintext highlighter-rouge">CILIUM_CALL_IPV4_FROM_LXC</code>作为<code class="language-plaintext highlighter-rouge">CALLS_MAP</code>的 index 时，其对应的 tail calls 函数如下所示。该函数主要先对数据包执行一些验证和过滤操作，之后通过 tail calls 的方式执行：对每个数据包进行到 service 的负载均衡，对应<code class="language-plaintext highlighter-rouge">__per_packet_lb_svc_xlate_4</code>函数，由于该函数内容并非本文重点，故略。</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// bpf/bpf_lxc.c</span>

<span class="n">__section_tail</span><span class="p">(</span><span class="n">CILIUM_MAP_CALLS</span><span class="p">,</span> <span class="n">CILIUM_CALL_IPV4_FROM_LXC</span><span class="p">)</span>
<span class="kt">int</span> <span class="nf">tail_handle_ipv4</span><span class="p">(</span><span class="k">struct</span> <span class="n">__ctx_buff</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
	<span class="n">__s8</span> <span class="n">ext_err</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
	<span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">__tail_handle_ipv4</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ext_err</span><span class="p">);</span>

	<span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">ret</span><span class="p">))</span>
		<span class="k">return</span> <span class="n">send_drop_notify_error_ext</span><span class="p">(</span><span class="cm">/*...*/</span><span class="p">);</span>
	<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="n">__always_inline</span> <span class="kt">int</span> <span class="nf">__tail_handle_ipv4</span><span class="p">(</span><span class="k">struct</span> <span class="n">__ctx_buff</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
					      <span class="n">__s8</span> <span class="o">*</span><span class="n">ext_err</span> <span class="n">__maybe_unused</span><span class="p">)</span>
<span class="p">{</span>
	<span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">,</span> <span class="o">*</span><span class="n">data_end</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">iphdr</span> <span class="o">*</span><span class="n">ip4</span><span class="p">;</span>

	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">revalidate_data_pull</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">data</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">data_end</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ip4</span><span class="p">))</span>  <span class="c1">// 验证包数据部分长度</span>
		<span class="k">return</span> <span class="n">DROP_INVALID</span><span class="p">;</span>

<span class="cp">#ifndef ENABLE_IPV4_FRAGMENTS  // 在 IPv4 分片未启用时，若接收到了 IPv4 分片报文，则直接丢弃
</span>	<span class="k">if</span> <span class="p">(</span><span class="n">ipv4_is_fragment</span><span class="p">(</span><span class="n">ip4</span><span class="p">))</span>
		<span class="k">return</span> <span class="n">DROP_FRAG_NOSUPPORT</span><span class="p">;</span>
<span class="cp">#endif
</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="o">!</span><span class="n">is_valid_lxc_src_ipv4</span><span class="p">(</span><span class="n">ip4</span><span class="p">)))</span>  <span class="c1">// 验证源 ip 地址是否有效</span>
		<span class="k">return</span> <span class="n">DROP_INVALID_SIP</span><span class="p">;</span>

<span class="cp">#ifdef ENABLE_PER_PACKET_LB
</span>	<span class="cm">/* 会内部执行 tailcall 或返回错误 */</span>
	<span class="k">return</span> <span class="n">__per_packet_lb_svc_xlate_4</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ip4</span><span class="p">,</span> <span class="n">ext_err</span><span class="p">);</span>
<span class="cp">#else
</span>	<span class="cm">/* 不会执行 tailcall */</span>
	<span class="k">return</span> <span class="n">tail_ipv4_ct_egress</span><span class="p">(</span><span class="n">ctx</span><span class="p">);</span>
<span class="cp">#endif </span><span class="cm">/* ENABLE_PER_PACKET_LB */</span><span class="cp">
</span><span class="p">}</span>
</code></pre></div></div>
<p>另外值得注意的一个点就是，<code class="language-plaintext highlighter-rouge">is_valid_lxc_src_ipv4</code>是如何验证源 IP 地址是否有效的？此函数是通过比较数据包的源地址与<code class="language-plaintext highlighter-rouge">LXC_IPV4</code>宏的值来验证的。<code class="language-plaintext highlighter-rouge">LXC_IPV4</code>这个宏是在 tc ReloadDatapath 之前，通过 <a href="https://shawnh2.github.io/post/2023/07/18/cilium-cni-walk-through.html#regenerate">regenerate 方法</a>写入到<code class="language-plaintext highlighter-rouge">/var/run/cilium/state/${endpoint-id}/ep_config.h</code>中的。</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ <span class="nb">cat</span> /var/run/cilium/state/1332/ep_config.h | <span class="nb">grep </span>IP
 <span class="k">*</span> IPv4 address: 10.244.2.149
DEFINE_U32<span class="o">(</span>LXC_IPV4, 0x9502f40a<span class="o">)</span><span class="p">;</span>	/<span class="k">*</span> 2499998730 <span class="k">*</span>/
<span class="c">#define LXC_IPV4 fetch_u32(LXC_IPV4)</span>
</code></pre></div></div>
<h3 id="endpoint-routes">Endpoint Routes</h3>
<p>在 Native Kubernetes 中运行 Cilium 时，由于<code class="language-plaintext highlighter-rouge">reloadDatapath</code>方法中<code class="language-plaintext highlighter-rouge">ep.RequireEgressProg()</code>和<code class="language-plaintext highlighter-rouge">ep.RequireEndpointRoute()</code>的返回值都是由 cilium-daemon 的<code class="language-plaintext highlighter-rouge">EnableEndpointRoutes</code>配置项控制的（该配置项<strong>默认情况下是关闭的</strong>），即表明对于非 host 类型的 endpoint 来说，<strong>BPF 程序的重载一般情况下只发生在 endpoint 接口的 ingress 方向</strong>。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/cmd/endpoint.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">d</span> <span class="o">*</span><span class="n">Daemon</span><span class="p">)</span> <span class="n">createEndpoint</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">owner</span> <span class="n">regeneration</span><span class="o">.</span><span class="n">Owner</span><span class="p">,</span> <span class="n">epTemplate</span> <span class="o">*</span><span class="n">models</span><span class="o">.</span><span class="n">EndpointChangeRequest</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">endpoint</span><span class="o">.</span><span class="n">Endpoint</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">option</span><span class="o">.</span><span class="n">Config</span><span class="o">.</span><span class="n">EnableEndpointRoutes</span> <span class="p">{</span>  <span class="c">// default: "false"</span>

		<span class="c">// 是否对每个 endpoint 都插入一条路由，而非使用经过 cilium_host 的路由</span>
		<span class="n">epTemplate</span><span class="o">.</span><span class="n">DatapathConfiguration</span><span class="o">.</span><span class="n">InstallEndpointRoute</span> <span class="o">=</span> <span class="no">true</span>  <span class="c">// 对应 RequireEndpointRoute()</span>

		<span class="c">// 由于直接通过 endpoint 的接口路由，绕过了 cilium_host 接口，所以 BPF 程序需要挂载在 endpoint 接口的 egress 方向</span>
		<span class="n">epTemplate</span><span class="o">.</span><span class="n">DatapathConfiguration</span><span class="o">.</span><span class="n">RequireEgressProg</span> <span class="o">=</span> <span class="no">true</span>  <span class="c">// 对应 RequireEgressProg()</span>

		<span class="c">// ...</span>
	<span class="p">}</span>
	<span class="c">// ...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>由于 Cilium 可以接入各公有云平台，所以若当使用公有云提供的网络服务时，<code class="language-plaintext highlighter-rouge">EnableEndpointRoutes</code>配置项才会被启用。以 GKE 为例，其可在 Cilium 运行为 Native-Routing 的模式下使用 Google Cloud Network（GCN），<a href="https://docs.cilium.io/en/stable/network/concepts/routing/#id6">其中就有一项配置</a>为<code class="language-plaintext highlighter-rouge">enable-endpoint-routes: true</code>。</p>

<p>在 Native-Routing 模式下，Cilium 会代理所有<strong>不是发往另一个 local endpoint 的</strong>数据包至 Linux 内核中的路由子系统。这意味着被路由的数据包就是像从本地进程发送出去的数据包一样，这也就要求集群内所有节点连接的网络层必须有路由<code class="language-plaintext highlighter-rouge">PodCIDRs</code>地址的能力，而 GCN 恰好就有此种能力。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-08-09/cilium-native-routes-gke.png" alt="native-routes-gke" /></p>

<p>观察 Native-Routing 模式下的路由表，可以发现其每项都由一个 endpoint 组成。而对比 Cilium <a href="https://shawnh2.github.io/post/2023/07/18/cilium-cni-walk-through.html#endpoint-%E8%B7%AF%E7%94%B1%E7%94%9F%E6%88%90">默认模式下的路由表</a>（<code class="language-plaintext highlighter-rouge">enable-local-node-route: true</code>），可见其路由项绕过了<code class="language-plaintext highlighter-rouge">cilium_host</code>设备，转而是直接通过 endpoint 的接口路由。所以 Cilium 为此种情况下 endpoint 接口的 egress 方向也做了 BPF 程序的重载。</p>

<h2 id="总结">总结</h2>
<p>本文从 host endpoint 与 endpoint 两种类型的 BPF 程序重载展开分析，并鸟瞰了两种加载的 BPF 程序代码。虽然 tc ReloadDatapath 是 Cilium CNI 工作的其中一步，但是也存在很多值得探讨的地方。本文只是以微观、局部的视角对 tc 的工作展开了分析，并没有对 Cilium 宏观、整体的过程展开描述，着实由于作者水平有限，浅尝辄止。若分析有误、考虑不全，望批评指正。</p>

<h2 id="reference">Reference</h2>

<ol>
  <li><a href="https://shawnh2.github.io/post/2023/07/18/cilium-cni-walk-through.html">https://shawnh2.github.io/post/2023/07/18/cilium-cni-walk-through.html</a></li>
  <li><a href="https://docs.cilium.io/en/stable/gettingstarted/terminology/#reserved-labels">https://docs.cilium.io/en/stable/gettingstarted/terminology/#reserved-labels</a></li>
  <li><a href="https://docs.cilium.io/en/stable/network/ebpf/intro/">https://docs.cilium.io/en/stable/network/ebpf/intro/</a></li>
  <li><a href="https://docs.cilium.io/en/latest/bpf/progtypes/#tc-traffic-control">https://docs.cilium.io/en/latest/bpf/progtypes/#tc-traffic-control</a></li>
  <li><a href="https://docs.cilium.io/en/stable/network/concepts/routing/">https://docs.cilium.io/en/stable/network/concepts/routing/</a></li>
  <li><a href="https://docs.cilium.io/en/stable/bpf/architecture/">https://docs.cilium.io/en/stable/bpf/architecture/</a></li>
  <li><a href="https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html">https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html</a></li>
  <li><a href="https://qmonnet.github.io/whirl-offload/2020/04/11/tc-bpf-direct-action/">https://qmonnet.github.io/whirl-offload/2020/04/11/tc-bpf-direct-action/</a></li>
  <li><a href="http://arthurchiao.art/blog/cilium-code-cni-create-network/#93-reload-datapath">http://arthurchiao.art/blog/cilium-code-cni-create-network/#93-reload-datapath</a></li>
  <li><a href="https://www.ebpf.top/post/bpf2pbpf_tail_call/">https://www.ebpf.top/post/bpf2pbpf_tail_call/</a></li>
</ol>]]></content><author><name>Your Name</name><email>shawnhxh@outlook.com</email></author><category term="post" /><category term="Network" /><category term="CNI" /><category term="Cilium" /><summary type="html"><![CDATA[本文代码基于 Cilium HEAD 4093531 展开。 在 Cilium CNI 中，每当 CiliumEndpoint 被创建时，都会触发Loader.CompileAndLoad方法的执行。在之前的文章中提到过，Cilium 使用tc（traffic control）来将编译好的 BPF 程序加载到内核，但针对具体加载过程、加载内容并没有展开描述，因此本文借机来一探究竟。 // pkg/datapath/loader/loader.go func (l *Loader) CompileAndLoad(ctx context.Context, ep datapath.Endpoint, stats *metrics.SpanStat) error { if ep == nil { log.Fatalf("LoadBPF() doesn't support non-endpoint load") } dirs := directoryInfo{ Library: option.Config.BpfDir, // /var/lib/cilium/bpf，存放 BPF 模版文件 Runtime: option.Config.StateDir, // /var/run/cilium，存放 endpoint 运行状态 State: ep.StateDir(), // /var/run/cilium/state/{endpoint-id} Output: ep.StateDir(), } return l.compileAndLoad(ctx, ep, &amp;dirs, stats) } func (l *Loader) compileAndLoad(ctx context.Context, ep datapath.Endpoint, dirs *directoryInfo) error { err := compileDatapath(ctx, dirs, ep.IsHost(), ep.Logger(Subsystem)) // 编译 BPF 程序 err = l.reloadDatapath(ctx, ep, dirs) // 加载 BPF 程序 return err }]]></summary></entry><entry><title type="html">Cilium CNI 工作原理解析</title><link href="https://shawnh2.github.io/post/2023/07/18/cilium-cni-walk-through.html" rel="alternate" type="text/html" title="Cilium CNI 工作原理解析" /><published>2023-07-18T00:00:00+08:00</published><updated>2023-07-18T00:00:00+08:00</updated><id>https://shawnh2.github.io/post/2023/07/18/cilium-cni-walk-through</id><content type="html" xml:base="https://shawnh2.github.io/post/2023/07/18/cilium-cni-walk-through.html"><![CDATA[<blockquote>
  <p>本文代码基于 Cilium HEAD <a href="https://github.com/cilium/cilium/commit/40935318e344424be1ea96510c96427aef5134c3">4093531</a>，主要围绕 Cilium CNI 的 Operation 展开。</p>
</blockquote>

<h2 id="添加网络">添加网络</h2>
<p>Cilium CNI 对于 ADD Operation 的操作定义在<code class="language-plaintext highlighter-rouge">plugins/cilium-cni/main.go</code>中，并由<code class="language-plaintext highlighter-rouge">cmdAdd</code>函数描述，该函数<strong>主要负责为 Pod 创建网络</strong>，其整体的控制时序流如下图所示。下图中在 IP 地址分配环节，描述了三种 IPAM 方式（host-scope、crd 和 eni），本文只关注 host-scope 这种默认的分配方式，即标记了红色背景的流程部分。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-07-16/cni-add-flow.png" alt="cni-add-flow" /></p>

<p>由于<code class="language-plaintext highlighter-rouge">cmdAdd</code>函数内容较多，下文将分段对其中重要的部分进行分析。</p>

<!--more-->

<h3 id="cni-配置与参数加载">CNI 配置与参数加载</h3>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// part 1</span>

<span class="k">func</span> <span class="n">cmdAdd</span><span class="p">(</span><span class="n">args</span> <span class="o">*</span><span class="n">skel</span><span class="o">.</span><span class="n">CmdArgs</span><span class="p">)</span> <span class="p">(</span><span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">var</span> <span class="p">(</span>
		<span class="n">ipConfig</span> <span class="o">*</span><span class="n">cniTypesV1</span><span class="o">.</span><span class="n">IPConfig</span>
		<span class="n">routes</span>   <span class="p">[]</span><span class="o">*</span><span class="n">cniTypes</span><span class="o">.</span><span class="n">Route</span>
		<span class="n">ipam</span>     <span class="o">*</span><span class="n">models</span><span class="o">.</span><span class="n">IPAMResponse</span>
		<span class="n">n</span>        <span class="o">*</span><span class="n">types</span><span class="o">.</span><span class="n">NetConf</span>
		<span class="n">c</span>        <span class="o">*</span><span class="n">client</span><span class="o">.</span><span class="n">Client</span>
		<span class="n">netNs</span>    <span class="n">ns</span><span class="o">.</span><span class="n">NetNS</span>
		<span class="n">conf</span>     <span class="o">*</span><span class="n">models</span><span class="o">.</span><span class="n">DaemonConfigurationStatus</span>
	<span class="p">)</span>  <span class="c">// 一些函数内全局使用的变量</span>

	<span class="n">n</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">types</span><span class="o">.</span><span class="n">LoadNetConf</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">StdinData</span><span class="p">)</span>  <span class="c">// 读取 cni 网络配置：/etc/cni/net.d/05-cilium-cni.conf</span>

	<span class="n">cniArgs</span> <span class="o">:=</span> <span class="n">types</span><span class="o">.</span><span class="n">ArgsSpec</span><span class="p">{}</span>
	<span class="n">cniTypes</span><span class="o">.</span><span class="n">LoadArgs</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">Args</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">cniArgs</span><span class="p">)</span>  <span class="c">// 加载 cni 参数</span>

	<span class="n">c</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">NewDefaultClientWithTimeout</span><span class="p">(</span><span class="n">defaults</span><span class="o">.</span><span class="n">ClientConnectTimeout</span><span class="p">)</span>  <span class="c">// 初始化一个客户端，以连接 cilium-daemon</span>

	<span class="c">// ...</span>
</code></pre></div></div>
<p>其中 Cilium CNI 网络配置文件<code class="language-plaintext highlighter-rouge">05-cilium-cni.conf</code>的默认内容如下所示：</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
    </span><span class="nl">"cniVersion"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0.3.1"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"cilium"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"cilium-cni"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"enable-debug"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
    </span><span class="nl">"log-file"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/var/run/cilium/cilium-cni.log"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>另外，初始化的 Client，默认情况下其是通过 UDS (UNIX domain socket) 来与 cilium-daemon 进行通信的，</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/client/client.go</span>

<span class="k">func</span> <span class="n">NewDefaultClient</span><span class="p">()</span> <span class="p">(</span><span class="o">*</span><span class="n">Client</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">NewClient</span><span class="p">(</span><span class="s">""</span><span class="p">)</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">NewClient</span><span class="p">(</span><span class="n">host</span> <span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">Client</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">clientTrans</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">NewRuntime</span><span class="p">(</span><span class="n">host</span><span class="p">)</span>
	<span class="k">return</span> <span class="o">&amp;</span><span class="n">Client</span><span class="p">{</span><span class="o">*</span><span class="n">clientapi</span><span class="o">.</span><span class="n">New</span><span class="p">(</span><span class="n">clientTrans</span><span class="p">,</span> <span class="n">strfmt</span><span class="o">.</span><span class="n">Default</span><span class="p">)},</span> <span class="n">err</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">NewRuntime</span><span class="p">(</span><span class="n">host</span> <span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">runtime_client</span><span class="o">.</span><span class="n">Runtime</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">host</span> <span class="o">==</span> <span class="s">""</span> <span class="p">{</span>
		<span class="n">host</span> <span class="o">=</span> <span class="n">DefaultSockPath</span><span class="p">()</span>
	<span class="p">}</span>
	<span class="n">tmp</span> <span class="o">:=</span> <span class="n">strings</span><span class="o">.</span><span class="n">SplitN</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="s">"://"</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>  <span class="c">// 根据不同的协议，组织不同的地址，目前只支持 tcp 与 unix socket 两种协议</span>
	<span class="k">switch</span> <span class="n">tmp</span><span class="p">[</span><span class="m">0</span><span class="p">]</span> <span class="p">{</span>
	<span class="k">case</span> <span class="s">"tcp"</span><span class="o">:</span>
		<span class="n">host</span> <span class="o">=</span> <span class="s">"http://"</span> <span class="o">+</span> <span class="n">tmp</span><span class="p">[</span><span class="m">1</span><span class="p">]</span>
	<span class="k">case</span> <span class="s">"unix"</span><span class="o">:</span>
		<span class="n">host</span> <span class="o">=</span> <span class="n">tmp</span><span class="p">[</span><span class="m">1</span><span class="p">]</span>
	<span class="p">}</span>

	<span class="n">transport</span> <span class="o">:=</span> <span class="n">configureTransport</span><span class="p">(</span><span class="no">nil</span><span class="p">,</span> <span class="n">tmp</span><span class="p">[</span><span class="m">0</span><span class="p">],</span> <span class="n">host</span><span class="p">)</span>
	<span class="n">httpClient</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">http</span><span class="o">.</span><span class="n">Client</span><span class="p">{</span><span class="n">Transport</span><span class="o">:</span> <span class="n">transport</span><span class="p">}</span>
	<span class="n">clientTrans</span> <span class="o">:=</span> <span class="n">runtime_client</span><span class="o">.</span><span class="n">NewWithClient</span><span class="p">(</span><span class="n">tmp</span><span class="p">[</span><span class="m">1</span><span class="p">],</span> <span class="n">clientapi</span><span class="o">.</span><span class="n">DefaultBasePath</span><span class="p">,</span> <span class="n">clientapi</span><span class="o">.</span><span class="n">DefaultSchemes</span><span class="p">,</span> <span class="n">httpClient</span><span class="p">)</span>
	<span class="k">return</span> <span class="n">clientTrans</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">DefaultSockPath</span><span class="p">()</span> <span class="kt">string</span> <span class="p">{</span>
	<span class="n">e</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">Getenv</span><span class="p">(</span><span class="n">defaults</span><span class="o">.</span><span class="n">SockPathEnv</span><span class="p">)</span>  <span class="c">// 从环境变量 CILIUM_SOCK 中获取 socket 地址</span>
	<span class="k">if</span> <span class="n">e</span> <span class="o">==</span> <span class="s">""</span> <span class="p">{</span>
		<span class="n">e</span> <span class="o">=</span> <span class="n">defaults</span><span class="o">.</span><span class="n">SockPath</span>  <span class="c">// 默认值为 /var/run/cilium/cilium.sock</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="s">"unix://"</span> <span class="o">+</span> <span class="n">e</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="网口去重与-daemon-状态">网口去重与 daemon 状态</h3>
<p>Cilium CNI 对于传入的创建网络接口名会先进行检查，若已经存在，则执行“替换”操作（即删除已有的网络接口，后续再创建新的）。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// part 2</span>

	<span class="n">netNs</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">ns</span><span class="o">.</span><span class="n">GetNS</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">Netns</span><span class="p">)</span>  <span class="c">// 获取网络命名空间</span>
	<span class="k">defer</span> <span class="n">netNs</span><span class="o">.</span><span class="n">Close</span><span class="p">()</span>

	<span class="n">err</span> <span class="o">=</span> <span class="n">netns</span><span class="o">.</span><span class="n">RemoveIfFromNetNSIfExists</span><span class="p">(</span><span class="n">netNs</span><span class="p">,</span> <span class="n">args</span><span class="o">.</span><span class="n">IfName</span><span class="p">)</span>  <span class="c">// 移除已存在的网络接口</span>
                           <span class="err">\</span>
                            <span class="err">\</span>
                             <span class="k">func</span> <span class="n">RemoveIfFromNetNSIfExists</span><span class="p">(</span><span class="n">netNS</span> <span class="n">ns</span><span class="o">.</span><span class="n">NetNS</span><span class="p">,</span> <span class="n">ifName</span> <span class="kt">string</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
                                <span class="k">return</span> <span class="n">netNS</span><span class="o">.</span><span class="n">Do</span><span class="p">(</span><span class="k">func</span><span class="p">(</span><span class="n">_</span> <span class="n">ns</span><span class="o">.</span><span class="n">NetNS</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
                                    <span class="n">l</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">netlink</span><span class="o">.</span><span class="n">LinkByName</span><span class="p">(</span><span class="n">ifName</span><span class="p">)</span>
                                    <span class="k">return</span> <span class="n">netlink</span><span class="o">.</span><span class="n">LinkDel</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
                                <span class="p">})</span>
                             <span class="p">}</span>

	<span class="n">addLabels</span> <span class="o">:=</span> <span class="n">models</span><span class="o">.</span><span class="n">Labels</span><span class="p">{}</span>

	<span class="n">conf</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">getConfigFromCiliumAgent</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>  <span class="c">// 从 cilium-agent 获取 cilium-daemon 的配置</span>

	<span class="c">// ...</span>
</code></pre></div></div>
<p>cilium-agent 是通过 Client 的 UDS 来向 cilium-daemon 发送请求以获取配置的，主要方法的调用栈如下：</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|- getConfigFromCiliumAgent
   |- client.ConfigGet
      |- client.Daemon.GetConfig
</code></pre></div></div>
<p>最后，<code class="language-plaintext highlighter-rouge">GetConfig</code>方法实际上是通过向 cilium-daemon 的<code class="language-plaintext highlighter-rouge">/config</code>路径发送<code class="language-plaintext highlighter-rouge">GET</code>请求以获取配置的，</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// api/v1/client/daemon/daemon_client.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">Client</span><span class="p">)</span> <span class="n">GetConfig</span><span class="p">(</span><span class="n">params</span> <span class="o">*</span><span class="n">GetConfigParams</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">ClientOption</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">GetConfigOK</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">params</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">params</span> <span class="o">=</span> <span class="n">NewGetConfigParams</span><span class="p">()</span>
	<span class="p">}</span>
	<span class="n">op</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">runtime</span><span class="o">.</span><span class="n">ClientOperation</span><span class="p">{</span>
		<span class="n">ID</span><span class="o">:</span>                 <span class="s">"GetConfig"</span><span class="p">,</span>
		<span class="n">Method</span><span class="o">:</span>             <span class="s">"GET"</span><span class="p">,</span>
		<span class="n">PathPattern</span><span class="o">:</span>        <span class="s">"/config"</span><span class="p">,</span>  <span class="c">// ***</span>
		<span class="n">ProducesMediaTypes</span><span class="o">:</span> <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">"application/json"</span><span class="p">},</span>
		<span class="n">ConsumesMediaTypes</span><span class="o">:</span> <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">"application/json"</span><span class="p">},</span>
		<span class="n">Schemes</span><span class="o">:</span>            <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">"http"</span><span class="p">},</span>
		<span class="n">Params</span><span class="o">:</span>             <span class="n">params</span><span class="p">,</span>
		<span class="n">Reader</span><span class="o">:</span>             <span class="o">&amp;</span><span class="n">GetConfigReader</span><span class="p">{</span><span class="n">formats</span><span class="o">:</span> <span class="n">a</span><span class="o">.</span><span class="n">formats</span><span class="p">},</span>
		<span class="n">Context</span><span class="o">:</span>            <span class="n">params</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span>
		<span class="n">Client</span><span class="o">:</span>             <span class="n">params</span><span class="o">.</span><span class="n">HTTPClient</span><span class="p">,</span>
	<span class="p">}</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">opt</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">opts</span> <span class="p">{</span>  <span class="c">// opts 默认情况下为空</span>
		<span class="n">opt</span><span class="p">(</span><span class="n">op</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="n">result</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">transport</span><span class="o">.</span><span class="n">Submit</span><span class="p">(</span><span class="n">op</span><span class="p">)</span>  <span class="c">// 提交请求</span>
	<span class="n">success</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">result</span><span class="o">.</span><span class="p">(</span><span class="o">*</span><span class="n">GetConfigOK</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">ok</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">success</span><span class="p">,</span> <span class="no">nil</span>
	<span class="p">}</span>
	<span class="c">// 若执行到此处说明提交未成功，直接 panic</span>
	<span class="nb">panic</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>而在 cilium-daemon 这一侧，其在启动时就注册了路径相关的 API，其中就包括了<code class="language-plaintext highlighter-rouge">/config</code>的：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/cmd/daemon_main.go</span>
<span class="c">// @ func (d *Daemon) instantiateAPI :: L1887-L1888</span>

<span class="c">// /config/</span>
<span class="n">restAPI</span><span class="o">.</span><span class="n">DaemonGetConfigHandler</span> <span class="o">=</span> <span class="n">NewGetConfigHandler</span><span class="p">(</span><span class="n">d</span><span class="p">)</span>  <span class="c">// 对应 GET 请求</span>
<span class="n">restAPI</span><span class="o">.</span><span class="n">DaemonPatchConfigHandler</span> <span class="o">=</span> <span class="n">NewPatchConfigHandler</span><span class="p">(</span><span class="n">d</span><span class="p">)</span>
</code></pre></div></div>
<p>cilium-daemon 对于该接口的响应由两部分组成，而最终 <strong>Cilium CNI 关注的</strong>（即<code class="language-plaintext highlighter-rouge">getConfigFromCiliumAgent</code>函数返回的）<strong>就只有</strong><code class="language-plaintext highlighter-rouge">Status</code><strong>部分</strong>。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">DaemonConfiguration</span> <span class="k">struct</span> <span class="p">{</span>
	<span class="c">// 描述了 daemon 的可变配置</span>
	<span class="n">Spec</span> <span class="o">*</span><span class="n">DaemonConfigurationSpec</span> <span class="s">`json:"spec,omitempty"`</span>

	<span class="c">// 目前 daemon 配置的相关状态，包括各种地址信息、可变与不可变配置项、node monitor 等</span>
	<span class="n">Status</span> <span class="o">*</span><span class="n">DaemonConfigurationStatus</span> <span class="s">`json:"status,omitempty"`</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="ip-分配与-ipam-模式">IP 分配与 IPAM 模式</h3>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// part 3</span>

	<span class="k">var</span> <span class="n">releaseIPsFunc</span> <span class="k">func</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">conf</span><span class="o">.</span><span class="n">IpamMode</span> <span class="o">==</span> <span class="n">ipamOption</span><span class="o">.</span><span class="n">IPAMDelegatedPlugin</span> <span class="p">{</span>  <span class="c">// 根据不同的 IPAM 模式来分配地址</span>
		<span class="n">ipam</span><span class="p">,</span> <span class="n">releaseIPsFunc</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">allocateIPsWithDelegatedPlugin</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">TODO</span><span class="p">(),</span> <span class="n">conf</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">args</span><span class="o">.</span><span class="n">StdinData</span><span class="p">)</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="n">ipam</span><span class="p">,</span> <span class="n">releaseIPsFunc</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">allocateIPsWithCiliumAgent</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">cniArgs</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="c">// 若在地址分配时出现错误，则把分配的地址释放掉</span>
	<span class="k">defer</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="o">&amp;&amp;</span> <span class="n">releaseIPsFunc</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">releaseIPsFunc</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">TODO</span><span class="p">())</span>
		<span class="p">}</span>
	<span class="p">}()</span>

	<span class="c">// ipam.HostAddressing 记录了 Cilium 的 Internal IP</span>
	<span class="n">connector</span><span class="o">.</span><span class="n">SufficientAddressing</span><span class="p">(</span><span class="n">ipam</span><span class="o">.</span><span class="n">HostAddressing</span><span class="p">)</span>  <span class="c">// 检查该 IP 地址是否提供了足够的信息，即 ipv4 或 ipv6 地址至少需要一个</span>

	<span class="c">// ...</span>
</code></pre></div></div>
<p>Cilium CNI 会根据不同的 IPAM 模式来执行不同的 IP 地址分配策略，其中<code class="language-plaintext highlighter-rouge">conf.IpamMode</code>是由<code class="language-plaintext highlighter-rouge">DaemonConfig.IPAM</code>赋值的，该值默认情况下为：</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ kubectl get configmap cilium-config <span class="nt">-n</span> kube-system <span class="nt">-o</span> yaml | <span class="nb">grep </span>ipam

<span class="c"># ipam: kubernetes</span>
</code></pre></div></div>
<p>Cilium 目前支持的完整的 IPAM 模式，由下列常量定义：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/ipam/option/option.go</span>

<span class="k">const</span> <span class="p">(</span>
	<span class="n">IPAMKubernetes</span> <span class="o">=</span> <span class="s">"kubernetes"</span>  <span class="c">// 默认值</span>

	<span class="n">IPAMCRD</span> <span class="o">=</span> <span class="s">"crd"</span>
	<span class="n">IPAMENI</span> <span class="o">=</span> <span class="s">"eni"</span>
	<span class="n">IPAMAzure</span> <span class="o">=</span> <span class="s">"azure"</span>
	<span class="n">IPAMClusterPool</span> <span class="o">=</span> <span class="s">"cluster-pool"</span>
	<span class="n">IPAMClusterPoolV2</span> <span class="o">=</span> <span class="s">"cluster-pool-v2beta"</span>
	<span class="n">IPAMAlibabaCloud</span> <span class="o">=</span> <span class="s">"alibabacloud"</span>

	<span class="n">IPAMDelegatedPlugin</span> <span class="o">=</span> <span class="s">"delegated-plugin"</span>  <span class="c">// 走 CNI plugin 委托</span>
<span class="p">)</span>
</code></pre></div></div>
<h4 id="delegated-plugin">Delegated Plugin</h4>
<p>对于使用 CNI plugin 委托机制来分配 IP 地址的情况，其主要使用了所<strong>委托 CNI plugin 对应的 ADD 动作</strong>；对于释放 IP 地址的操作，其对应所<strong>委托 CNI plugin 的  DEL 动作</strong>。由于该函数只在<code class="language-plaintext highlighter-rouge">IPAMDelegatedPlugin</code>这一种模式下才生效，故最后还将 CNI plugin 委托调用的结果翻译为了<code class="language-plaintext highlighter-rouge">IPAMResponse</code>类型，以对齐<code class="language-plaintext highlighter-rouge">allocateIPsWithCiliumAgent</code>函数的返回值。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">allocateIPsWithDelegatedPlugin</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">conf</span> <span class="o">*</span><span class="n">models</span><span class="o">.</span><span class="n">DaemonConfigurationStatus</span><span class="p">,</span> <span class="n">netConf</span> <span class="o">*</span><span class="n">types</span><span class="o">.</span><span class="n">NetConf</span><span class="p">,</span> <span class="n">stdinData</span> <span class="p">[]</span><span class="kt">byte</span><span class="p">,</span>
<span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">models</span><span class="o">.</span><span class="n">IPAMResponse</span><span class="p">,</span> <span class="k">func</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">),</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="c">// netConf.IPAM.Type 描述了所委托 plugin 的名字，stdinData 描述了调用该委托所需的输入参数</span>
	<span class="n">ipamRawResult</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">cniInvoke</span><span class="o">.</span><span class="n">DelegateAdd</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">netConf</span><span class="o">.</span><span class="n">IPAM</span><span class="o">.</span><span class="n">Type</span><span class="p">,</span> <span class="n">stdinData</span><span class="p">,</span> <span class="no">nil</span><span class="p">)</span>  <span class="c">// 调用委托 plugin 的 CNI ADD</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="c">// IP 地址分配失败，此时没有 IP 需要清理，故不返回 releaseFunc</span>
		<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="no">nil</span><span class="p">,</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"failed to invoke delegated plugin ADD for IPAM: %w"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="c">// 预备好 CNI DEL 动作的闭包</span>
	<span class="n">releaseFunc</span> <span class="o">:=</span> <span class="k">func</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">cniInvoke</span><span class="o">.</span><span class="n">DelegateDel</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">netConf</span><span class="o">.</span><span class="n">IPAM</span><span class="o">.</span><span class="n">Type</span><span class="p">,</span> <span class="n">stdinData</span><span class="p">,</span> <span class="no">nil</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="n">ipamResult</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">cniTypesV1</span><span class="o">.</span><span class="n">NewResultFromResult</span><span class="p">(</span><span class="n">ipamRawResult</span><span class="p">)</span>  <span class="c">// 上述委托调用返回的是原始结果，此处将其转换为 CNI spec v1.0 版本对应的结果</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">releaseFunc</span><span class="p">,</span> <span class="c">// msg</span>
	<span class="p">}</span>

	<span class="c">// 这里做格式统一，将委托调用的结果与通过 cilium-agent 分配 IP 的结果对齐</span>
	<span class="n">ipam</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">models</span><span class="o">.</span><span class="n">IPAMResponse</span><span class="p">{</span>
		<span class="n">HostAddressing</span><span class="o">:</span> <span class="n">conf</span><span class="o">.</span><span class="n">Addressing</span><span class="p">,</span>
		<span class="n">Address</span><span class="o">:</span>        <span class="o">&amp;</span><span class="n">models</span><span class="o">.</span><span class="n">AddressPair</span><span class="p">{},</span>
	<span class="p">}</span>
	<span class="c">// 记录分配的每个 ipv4 或 ipv6 地址</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">ipConfig</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">ipamResult</span><span class="o">.</span><span class="n">IPs</span> <span class="p">{</span>
		<span class="n">ipNet</span> <span class="o">:=</span> <span class="n">ipConfig</span><span class="o">.</span><span class="n">Address</span>
		<span class="k">if</span> <span class="n">ipv4</span> <span class="o">:=</span> <span class="n">ipNet</span><span class="o">.</span><span class="n">IP</span><span class="o">.</span><span class="n">To4</span><span class="p">();</span> <span class="n">ipv4</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">ipam</span><span class="o">.</span><span class="n">Address</span><span class="o">.</span><span class="n">IPV4</span> <span class="o">=</span> <span class="n">ipNet</span><span class="o">.</span><span class="n">String</span><span class="p">()</span>
			<span class="n">ipam</span><span class="o">.</span><span class="n">IPV4</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">models</span><span class="o">.</span><span class="n">IPAMAddressResponse</span><span class="p">{</span><span class="n">IP</span><span class="o">:</span> <span class="n">ipv4</span><span class="o">.</span><span class="n">String</span><span class="p">()}</span>
		<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
			<span class="n">ipam</span><span class="o">.</span><span class="n">Address</span><span class="o">.</span><span class="n">IPV6</span> <span class="o">=</span> <span class="n">ipNet</span><span class="o">.</span><span class="n">String</span><span class="p">()</span>
			<span class="n">ipam</span><span class="o">.</span><span class="n">IPV6</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">models</span><span class="o">.</span><span class="n">IPAMAddressResponse</span><span class="p">{</span><span class="n">IP</span><span class="o">:</span> <span class="n">ipNet</span><span class="o">.</span><span class="n">IP</span><span class="o">.</span><span class="n">String</span><span class="p">()}</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">ipam</span><span class="p">,</span> <span class="n">releaseFunc</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="cilium-agent">Cilium Agent</h4>
<p>除了<code class="language-plaintext highlighter-rouge">IPAMDelegatedPlugin</code>模式之外，其他 IPAM 模式都会执行以下函数来分配 IP 地址。在该函数中，IP 地址的分配和释放都是通过 cilium-agent 来完成的。与上述 cilium-agent 获取 cilium-daemon 配置的方式一致，<code class="language-plaintext highlighter-rouge">IPAMAllocate</code>是通过 cilium-agent 向 cilium-daemon 的<code class="language-plaintext highlighter-rouge">/ipam</code>路径发送 POST 请求，而<code class="language-plaintext highlighter-rouge">IPAMReleaseIP</code>则是向 cilium-daemon 的<code class="language-plaintext highlighter-rouge">/ipam/{ip}</code>路径发送 DELETE 请求。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">allocateIPsWithCiliumAgent</span><span class="p">(</span><span class="n">client</span> <span class="o">*</span><span class="n">client</span><span class="o">.</span><span class="n">Client</span><span class="p">,</span> <span class="n">cniArgs</span> <span class="n">types</span><span class="o">.</span><span class="n">ArgsSpec</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">models</span><span class="o">.</span><span class="n">IPAMResponse</span><span class="p">,</span> <span class="k">func</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">),</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">podName</span> <span class="o">:=</span> <span class="kt">string</span><span class="p">(</span><span class="n">cniArgs</span><span class="o">.</span><span class="n">K8S_POD_NAMESPACE</span><span class="p">)</span> <span class="o">+</span> <span class="s">"/"</span> <span class="o">+</span> <span class="kt">string</span><span class="p">(</span><span class="n">cniArgs</span><span class="o">.</span><span class="n">K8S_POD_NAME</span><span class="p">)</span>  <span class="c">// namespaced name</span>
	<span class="n">pool</span> <span class="o">:=</span> <span class="s">""</span>
	<span class="n">ipam</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">client</span><span class="o">.</span><span class="n">IPAMAllocate</span><span class="p">(</span><span class="s">""</span><span class="p">,</span> <span class="n">podName</span><span class="p">,</span> <span class="n">pool</span><span class="p">,</span> <span class="no">true</span><span class="p">)</span>  <span class="c">// 通过本地 cilium-agent 分配地址</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
	<span class="p">}</span>
	<span class="k">if</span> <span class="n">ipam</span><span class="o">.</span><span class="n">Address</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>  <span class="c">// 无地址字段</span>
		<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
	<span class="p">}</span>

	<span class="n">releaseFunc</span> <span class="o">:=</span> <span class="k">func</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">ipam</span><span class="o">.</span><span class="n">Address</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">releaseIP</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">ipam</span><span class="o">.</span><span class="n">Address</span><span class="o">.</span><span class="n">IPV4</span><span class="p">,</span> <span class="n">pool</span><span class="p">)</span>
			<span class="n">releaseIP</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">ipam</span><span class="o">.</span><span class="n">Address</span><span class="o">.</span><span class="n">IPV6</span><span class="p">,</span> <span class="n">pool</span><span class="p">)</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">ipam</span><span class="p">,</span> <span class="n">releaseFunc</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">releaseIP</span><span class="p">(</span><span class="n">client</span> <span class="o">*</span><span class="n">client</span><span class="o">.</span><span class="n">Client</span><span class="p">,</span> <span class="n">ip</span><span class="p">,</span> <span class="n">pool</span> <span class="kt">string</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">ip</span> <span class="o">!=</span> <span class="s">""</span> <span class="p">{</span>
		<span class="n">err</span> <span class="o">:=</span> <span class="n">client</span><span class="o">.</span><span class="n">IPAMReleaseIP</span><span class="p">(</span><span class="n">ip</span><span class="p">,</span> <span class="n">pool</span><span class="p">)</span>  <span class="c">// 通过本地 cilium-agent 释放地址</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在 cilium-daemon 中，注册有关 IPAM API 的 Handler 如下所示：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/cmd/daemon_main.go</span>
<span class="c">// @ func (d *Daemon) instantiateAPI :: L1955-1960</span>

<span class="k">if</span> <span class="n">option</span><span class="o">.</span><span class="n">Config</span><span class="o">.</span><span class="n">DatapathMode</span> <span class="o">!=</span> <span class="n">datapathOption</span><span class="o">.</span><span class="n">DatapathModeLBOnly</span> <span class="p">{</span>
    <span class="c">// /ipam/{ip}/</span>
    <span class="n">restAPI</span><span class="o">.</span><span class="n">IpamPostIpamHandler</span> <span class="o">=</span> <span class="n">NewPostIPAMHandler</span><span class="p">(</span><span class="n">d</span><span class="p">)</span>  <span class="c">// 对应 IPAMAllocate</span>
    <span class="n">restAPI</span><span class="o">.</span><span class="n">IpamPostIpamIPHandler</span> <span class="o">=</span> <span class="n">NewPostIPAMIPHandler</span><span class="p">(</span><span class="n">d</span><span class="p">)</span>
    <span class="n">restAPI</span><span class="o">.</span><span class="n">IpamDeleteIpamIPHandler</span> <span class="o">=</span> <span class="n">NewDeleteIPAMIPHandler</span><span class="p">(</span><span class="n">d</span><span class="p">)</span>  <span class="c">// 对应 IPAMReleaseIP</span>
<span class="p">}</span>
</code></pre></div></div>
<p>对于分配新 IP 地址的 Handler 来说，其具体的调用链路如下所示：</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|- daemon.ipam.AllocateNextWithExpiration @ daemon/cmd/ipam.go#L49
   |- ipam.AllocateNext                   @ pkg/ipam/allocator.go#222
      |- ipam.AllocateNextFamily
         |- ipam.allocateNextFamily
            |- allocator.AllocateNext     @ interface
               |- implemented by @ clusterPoolAllocator
                                 @ crdAllocator
                                 @ hostScopeAllocator
                                 @ noOpAllocator
</code></pre></div></div>
<p>其中<code class="language-plaintext highlighter-rouge">AllocateNext</code>方法是由<code class="language-plaintext highlighter-rouge">Allocator</code>接口（<code class="language-plaintext highlighter-rouge">pkg/ipam/types.go</code>）定义的，该方法用于分配下一个可用 IP 地址或当没有可用 IP 时返回错误。实现此方法的结构体有很多，但<code class="language-plaintext highlighter-rouge">hostScopeAllocator</code>是默认使用的（对应<code class="language-plaintext highlighter-rouge">IPAMKubernetes</code>模式）。在 host-scope IPAM 模式下，IP 地址是从 K8s 中每个 Node 定义的<code class="language-plaintext highlighter-rouge">PodCIDR</code>或<code class="language-plaintext highlighter-rouge">PodCIDRs</code>范围内分配的，如下图所示。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-07-16/ipam-host-scope.png" alt="ipam-host-scope" /></p>

<p>对于释放 IP 地址的 Handler 来说，其调用链路与上述分配过程类似，最后也都是调用<code class="language-plaintext highlighter-rouge">Allocator</code>接口的<code class="language-plaintext highlighter-rouge">Release</code>方法，该方法也有和上述一样的结构体实现。</p>
<h4 id="cilium-internal-ip">Cilium Internal IP</h4>
<p>无论使用哪种 IP 分配方式，最终分配 IP 的结果都保存在<code class="language-plaintext highlighter-rouge">IPAMResponse</code>结构体中。该结构体还存在一个名为<code class="language-plaintext highlighter-rouge">HostAddressing</code>的字段，该字段很容易被误解为 Pod 所在宿主机的 IP，但<strong>实际上它保存的是 Cilium Internal IP</strong>。与 K8s 中 Node 资源相对应，Cilium 也定义了一个名为 <a href="https://doc.crds.dev/github.com/cilium/cilium/cilium.io/CiliumNode/v2@v1.14.0-snapshot.4">CiliumNode</a> 的资源，用于表示 Cilium 所管理的 Node：</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ kubectl get ciliumnodes.cilium.io
NAME                 CILIUMINTERNALIP   INTERNALIP   AGE
kind-control-plane   10.244.0.48        172.19.0.4   2d11h
kind-worker          10.244.2.212       172.19.0.3   2d11h
kind-worker2         10.244.1.196       172.19.0.5   2d11h
</code></pre></div></div>
<p>从 CiliumNode 的 spec 可以看出，其集成了 Cilium CNI 所有需要关注的 IP 地址等信息，从而方便 cilium-agent 的获取。Cilium Internal IP 也是 Cilium 自动为每个 CiliumNode 分配的 IP，该 IP 与 Node 中定义的<code class="language-plaintext highlighter-rouge">PodCIDRs</code>同属一个网段。可以看出，<strong>Cilium Internal IP 的存在就是为了方便集群中 Nodes 间的通信</strong>，即由 CiliumNode 组成了一个 overlay 模式的网络。</p>

<p>在 IP 分配步骤的最后，还对 Cilium Internal IP 是否存在进行了检查。若该 IP 不存在，则退出 CNI Add Action 的执行。</p>
<h3 id="veth-网口设置">veth 网口设置</h3>
<p>默认情况下，在启动 cilium-daemon 的运行配置中，其 datapath 模式为<code class="language-plaintext highlighter-rouge">veth</code>，故一般都会进行 veth pair 的创建。截止到目前，<strong>Cilium 对于 datapath 模式的定义只包含两种：</strong><code class="language-plaintext highlighter-rouge">veth</code><strong>和</strong><code class="language-plaintext highlighter-rouge">lb-only</code>（<code class="language-plaintext highlighter-rouge">pkg/datapath/option/option.go</code>）。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// part 4</span>

	<span class="k">switch</span> <span class="n">conf</span><span class="o">.</span><span class="n">DatapathMode</span> <span class="p">{</span>
	<span class="k">case</span> <span class="n">datapathOption</span><span class="o">.</span><span class="n">DatapathModeVeth</span><span class="o">:</span>  <span class="c">// veth 模式</span>
		<span class="k">var</span> <span class="p">(</span>
			<span class="n">veth</span>      <span class="o">*</span><span class="n">netlink</span><span class="o">.</span><span class="n">Veth</span>
			<span class="n">peer</span>      <span class="n">netlink</span><span class="o">.</span><span class="n">Link</span>
			<span class="n">tmpIfName</span> <span class="kt">string</span>
		<span class="p">)</span>
		<span class="c">// 先在 host 侧创建 veth pair 接口</span>
		<span class="n">veth</span><span class="p">,</span> <span class="n">peer</span><span class="p">,</span> <span class="n">tmpIfName</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">connector</span><span class="o">.</span><span class="n">SetupVeth</span><span class="p">(</span><span class="n">ep</span><span class="o">.</span><span class="n">ContainerID</span><span class="p">,</span> <span class="kt">int</span><span class="p">(</span><span class="n">conf</span><span class="o">.</span><span class="n">DeviceMTU</span><span class="p">),</span> <span class="kt">int</span><span class="p">(</span><span class="n">conf</span><span class="o">.</span><span class="n">GROMaxSize</span><span class="p">),</span> <span class="kt">int</span><span class="p">(</span><span class="n">conf</span><span class="o">.</span><span class="n">GSOMaxSize</span><span class="p">),</span> <span class="n">ep</span><span class="p">)</span>

		<span class="k">defer</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
			<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
				<span class="n">err2</span> <span class="o">:=</span> <span class="n">netlink</span><span class="o">.</span><span class="n">LinkDel</span><span class="p">(</span><span class="n">veth</span><span class="p">)</span>  <span class="c">// 配置失败时，删除该 veth pair 接口</span>
			<span class="p">}</span>
		<span class="p">}()</span>

		<span class="n">err</span> <span class="o">=</span> <span class="n">netlink</span><span class="o">.</span><span class="n">LinkSetNsFd</span><span class="p">(</span><span class="n">peer</span><span class="p">,</span> <span class="kt">int</span><span class="p">(</span><span class="n">netNs</span><span class="o">.</span><span class="n">Fd</span><span class="p">()))</span>  <span class="c">// 将 veth pair 的对端移动到 netns 中</span>

		<span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">connector</span><span class="o">.</span><span class="n">SetupVethRemoteNs</span><span class="p">(</span><span class="n">netNs</span><span class="p">,</span> <span class="n">tmpIfName</span><span class="p">,</span> <span class="n">args</span><span class="o">.</span><span class="n">IfName</span><span class="p">)</span>  <span class="c">// 最后在容器侧配置 veth 接口名</span>
	<span class="p">}</span>

	<span class="c">// ...</span>
</code></pre></div></div>
<p>值得注意的是，本端 veth 接口与对端接口 link 在<code class="language-plaintext highlighter-rouge">connector.SetupVeth</code>中就已经创建完成了，其中本端接口与对端 link 存在以下命名规则：</p>

<ul>
  <li>对于本端接口名，为<code class="language-plaintext highlighter-rouge">lxc</code>+<code class="language-plaintext highlighter-rouge">sha256(containerID)</code>的前 N 位</li>
  <li>至于对端 link 名，为<code class="language-plaintext highlighter-rouge">tmp</code>+<code class="language-plaintext highlighter-rouge">sha256(containerID)</code>的前 N 位；可以看出其所命名为临时名称</li>
</ul>

<p>之后通过<code class="language-plaintext highlighter-rouge">LinkSetNsFd</code>将对端 link 加入到目标网络命名空间中，最后通过<code class="language-plaintext highlighter-rouge">connector.SetupVethRemoteNs</code>将对端 link 名更改为 CNI 参数中定义的网络接口名。当然，这其中涉及到的所有与网络接口有关的操作，都使用的是 netlink 库提供的接口。</p>
<h3 id="endpoint-路由生成">Endpoint 路由生成</h3>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// part 5</span>

	<span class="n">ep</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">models</span><span class="o">.</span><span class="n">EndpointChangeRequest</span><span class="p">{</span>  <span class="c">// 该结构体包含了 Cilium Endpoint 的所有可变元素</span>
		<span class="n">ContainerID</span><span class="o">:</span>           <span class="n">args</span><span class="o">.</span><span class="n">ContainerID</span><span class="p">,</span>
		<span class="n">Addressing</span><span class="o">:</span>            <span class="o">&amp;</span><span class="n">models</span><span class="o">.</span><span class="n">AddressPair</span><span class="p">{},</span>
		<span class="n">K8sPodName</span><span class="o">:</span>            <span class="kt">string</span><span class="p">(</span><span class="n">cniArgs</span><span class="o">.</span><span class="n">K8S_POD_NAME</span><span class="p">),</span>
		<span class="n">K8sNamespace</span><span class="o">:</span>          <span class="kt">string</span><span class="p">(</span><span class="n">cniArgs</span><span class="o">.</span><span class="n">K8S_POD_NAMESPACE</span><span class="p">),</span>
		<span class="c">// ...</span>
	<span class="p">}</span>

	<span class="n">state</span> <span class="o">:=</span> <span class="n">CmdState</span><span class="p">{</span>
		<span class="n">Endpoint</span><span class="o">:</span> <span class="n">ep</span><span class="p">,</span>
		<span class="n">Client</span><span class="o">:</span>   <span class="n">c</span><span class="p">,</span>
		<span class="n">HostAddr</span><span class="o">:</span> <span class="n">ipam</span><span class="o">.</span><span class="n">HostAddressing</span><span class="p">,</span>  <span class="c">// Cilium Interna IP</span>
	<span class="p">}</span>

	<span class="n">res</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">cniTypesV1</span><span class="o">.</span><span class="n">Result</span><span class="p">{}</span>  <span class="c">// 该函数最后的返回值</span>

	<span class="k">if</span> <span class="n">ipv4IsEnabled</span><span class="p">(</span><span class="n">ipam</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">ep</span><span class="o">.</span><span class="n">Addressing</span><span class="o">.</span><span class="n">IPV4</span> <span class="o">=</span> <span class="n">ipam</span><span class="o">.</span><span class="n">Address</span><span class="o">.</span><span class="n">IPV4</span>
		<span class="n">ep</span><span class="o">.</span><span class="n">Addressing</span><span class="o">.</span><span class="n">IPV4ExpirationUUID</span> <span class="o">=</span> <span class="n">ipam</span><span class="o">.</span><span class="n">IPV4</span><span class="o">.</span><span class="n">ExpirationUUID</span>

		<span class="n">ipConfig</span><span class="p">,</span> <span class="n">routes</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">prepareIP</span><span class="p">(</span><span class="n">ep</span><span class="o">.</span><span class="n">Addressing</span><span class="o">.</span><span class="n">IPV4</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">state</span><span class="p">,</span> <span class="kt">int</span><span class="p">(</span><span class="n">conf</span><span class="o">.</span><span class="n">RouteMTU</span><span class="p">))</span>  <span class="c">// 解析 IP 格式，返回 IP 与网关地址；及对应路由</span>

		<span class="n">res</span><span class="o">.</span><span class="n">IPs</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">res</span><span class="o">.</span><span class="n">IPs</span><span class="p">,</span> <span class="n">ipConfig</span><span class="p">)</span>
		<span class="n">res</span><span class="o">.</span><span class="n">Routes</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">res</span><span class="o">.</span><span class="n">Routes</span><span class="p">,</span> <span class="n">routes</span><span class="o">...</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="c">// if ipv6IsEnabled(ipam) { 略，其内容同上 }</span>

	<span class="c">// ...</span>
</code></pre></div></div>
<p>此段逻辑主要对应于<code class="language-plaintext highlighter-rouge">cmdAdd</code>函数对应返回值的构建，该返回值对应的<code class="language-plaintext highlighter-rouge">IPs</code>和<code class="language-plaintext highlighter-rouge">Routes</code>字段都是通过<code class="language-plaintext highlighter-rouge">prepareIP</code>函数对无论是来自于 Delegated Plugin 还是 cilium-agent 的 IP（默认为 CIDR 格式）地址进行解析才得到的：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">prepareIP</span><span class="p">(</span><span class="n">ipAddr</span> <span class="kt">string</span><span class="p">,</span> <span class="n">state</span> <span class="o">*</span><span class="n">CmdState</span><span class="p">,</span> <span class="n">mtu</span> <span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">cniTypesV1</span><span class="o">.</span><span class="n">IPConfig</span><span class="p">,</span> <span class="p">[]</span><span class="o">*</span><span class="n">cniTypes</span><span class="o">.</span><span class="n">Route</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">var</span> <span class="p">(</span>
		<span class="n">routes</span> <span class="p">[]</span><span class="n">route</span><span class="o">.</span><span class="n">Route</span>
		<span class="n">gw</span>     <span class="kt">string</span>
		<span class="n">ip</span>     <span class="n">netip</span><span class="o">.</span><span class="n">Addr</span>
	<span class="p">)</span>

	<span class="c">// 根据 CIDR 格式解析 IP 地址</span>
	<span class="n">ipPrefix</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">netip</span><span class="o">.</span><span class="n">ParsePrefix</span><span class="p">(</span><span class="n">ipAddr</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">ip</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">netip</span><span class="o">.</span><span class="n">ParseAddr</span><span class="p">(</span><span class="n">ipAddr</span><span class="p">)</span>  <span class="c">// 非 CIDR 格式的 IP 地址</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="n">ip</span> <span class="o">=</span> <span class="n">ipPrefix</span><span class="o">.</span><span class="n">Addr</span><span class="p">()</span>
	<span class="p">}</span>

	<span class="k">if</span> <span class="n">ip</span><span class="o">.</span><span class="n">Is6</span><span class="p">()</span> <span class="p">{</span>
		<span class="c">// 逻辑同下，略</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="n">state</span><span class="o">.</span><span class="n">IP4</span> <span class="o">=</span> <span class="n">ip</span>
		<span class="n">state</span><span class="o">.</span><span class="n">IP4routes</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">connector</span><span class="o">.</span><span class="n">IPv4Routes</span><span class="p">(</span><span class="n">state</span><span class="o">.</span><span class="n">HostAddr</span><span class="p">,</span> <span class="n">mtu</span><span class="p">)</span>  <span class="c">// 获取需要被安装在 Endpoint 网络命名空间内的路由</span>
		<span class="n">routes</span> <span class="o">=</span> <span class="n">state</span><span class="o">.</span><span class="n">IP4routes</span>
		<span class="n">ip</span> <span class="o">=</span> <span class="n">state</span><span class="o">.</span><span class="n">IP4</span>
		<span class="n">gw</span> <span class="o">=</span> <span class="n">connector</span><span class="o">.</span><span class="n">IPv4Gateway</span><span class="p">(</span><span class="n">state</span><span class="o">.</span><span class="n">HostAddr</span><span class="p">)</span>  <span class="c">// 返回 Endpoint 对应的网关地址，即 Cilium Interna IP 地址 =&gt; return addr.IPV4.IP</span>
	<span class="p">}</span>

	<span class="n">rt</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="o">*</span><span class="n">cniTypes</span><span class="o">.</span><span class="n">Route</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">routes</span><span class="p">))</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">r</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">routes</span> <span class="p">{</span>
		<span class="n">rt</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">rt</span><span class="p">,</span> <span class="n">newCNIRoute</span><span class="p">(</span><span class="n">r</span><span class="p">))</span>  <span class="c">// 转换为 CNI 支持的 Route 类型</span>
	<span class="p">}</span>
	<span class="n">gwIP</span> <span class="o">:=</span> <span class="n">net</span><span class="o">.</span><span class="n">ParseIP</span><span class="p">(</span><span class="n">gw</span><span class="p">)</span>

	<span class="k">return</span> <span class="o">&amp;</span><span class="n">cniTypesV1</span><span class="o">.</span><span class="n">IPConfig</span><span class="p">{</span>
		<span class="n">Address</span><span class="o">:</span> <span class="o">*</span><span class="n">iputil</span><span class="o">.</span><span class="n">AddrToIPNet</span><span class="p">(</span><span class="n">ip</span><span class="p">),</span>
		<span class="n">Gateway</span><span class="o">:</span> <span class="n">gwIP</span><span class="p">,</span>
	<span class="p">},</span> <span class="n">rt</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>其中，有关<code class="language-plaintext highlighter-rouge">connector.IPv4Routes</code>路由获取的部分，其入参<code class="language-plaintext highlighter-rouge">state.HostAddr</code>本质上就是 Cilium Internal IP。每个 Endpoint 在各自网络命名空间中都会使用此 Internal IP 来创建一条默认路由：<strong>对于所有未知目的 IP 的流量都会经过下一跳的 Cilium Internal IP 地址转发，此时该地址充当 Endpoint 的默认网关，而该网关地址则是一个前缀路由</strong>。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/datapath/connector/ipam.go</span>

<span class="k">func</span> <span class="n">IPv4Routes</span><span class="p">(</span><span class="n">addr</span> <span class="o">*</span><span class="n">models</span><span class="o">.</span><span class="n">NodeAddressing</span><span class="p">,</span> <span class="n">linkMTU</span> <span class="kt">int</span><span class="p">)</span> <span class="p">([]</span><span class="n">route</span><span class="o">.</span><span class="n">Route</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">ip</span> <span class="o">:=</span> <span class="n">net</span><span class="o">.</span><span class="n">ParseIP</span><span class="p">(</span><span class="n">addr</span><span class="o">.</span><span class="n">IPV4</span><span class="o">.</span><span class="n">IP</span><span class="p">)</span>

	<span class="k">return</span> <span class="p">[]</span><span class="n">route</span><span class="o">.</span><span class="n">Route</span><span class="p">{</span>
		<span class="p">{</span>
			<span class="n">Prefix</span><span class="o">:</span> <span class="n">net</span><span class="o">.</span><span class="n">IPNet</span><span class="p">{</span>
				<span class="n">IP</span><span class="o">:</span>   <span class="n">ip</span><span class="p">,</span>
				<span class="n">Mask</span><span class="o">:</span> <span class="n">defaults</span><span class="o">.</span><span class="n">ContainerIPv4Mask</span><span class="p">,</span>  <span class="c">// 255.255.255.255</span>
			<span class="p">},</span>
		<span class="p">},</span>
		<span class="p">{</span>
			<span class="n">Prefix</span><span class="o">:</span>  <span class="n">defaults</span><span class="o">.</span><span class="n">IPv4DefaultRoute</span><span class="p">,</span>  <span class="c">// 0.0.0.0/32</span>
			<span class="n">Nexthop</span><span class="o">:</span> <span class="o">&amp;</span><span class="n">ip</span><span class="p">,</span>
			<span class="n">MTU</span><span class="o">:</span>     <span class="n">linkMTU</span><span class="p">,</span>
		<span class="p">},</span>
	<span class="p">},</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="endpoint-创建">Endpoint 创建</h3>
<p>本节所述内容虽然没在文章开头的时序图中显示，但也是 CNI ADD 操作中最重要的一环。有关此步，<a href="http://arthurchiao.art/blog/cilium-code-cni-create-network/#8-upsert-ip-information-to-kvstore">arthurchiao</a> 总结的一张图不错，可以参考：</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-07-16/endpoint-creation.png" alt="endpoint-creation" /></p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// part 6</span>

	<span class="k">var</span> <span class="n">macAddrStr</span> <span class="kt">string</span>
	<span class="n">err</span> <span class="o">=</span> <span class="n">netNs</span><span class="o">.</span><span class="n">Do</span><span class="p">(</span><span class="k">func</span><span class="p">(</span><span class="n">_</span> <span class="n">ns</span><span class="o">.</span><span class="n">NetNS</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
		<span class="n">macAddrStr</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">configureIface</span><span class="p">(</span><span class="n">ipam</span><span class="p">,</span> <span class="n">args</span><span class="o">.</span><span class="n">IfName</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">state</span><span class="p">)</span>  <span class="c">// 开启接口，并写入 ip 和路由，最后返回该接口的硬件 MAC 地址</span>
		<span class="k">return</span> <span class="n">err</span>
	<span class="p">})</span>

	<span class="n">res</span><span class="o">.</span><span class="n">Interfaces</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">res</span><span class="o">.</span><span class="n">Interfaces</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">cniTypesV1</span><span class="o">.</span><span class="n">Interface</span><span class="p">{</span>  <span class="c">// 记录网络接口</span>
		<span class="n">Name</span><span class="o">:</span>    <span class="n">args</span><span class="o">.</span><span class="n">IfName</span><span class="p">,</span>
		<span class="n">Mac</span><span class="o">:</span>     <span class="n">macAddrStr</span><span class="p">,</span>
		<span class="n">Sandbox</span><span class="o">:</span> <span class="n">args</span><span class="o">.</span><span class="n">Netns</span><span class="p">,</span>
	<span class="p">})</span>

	<span class="c">// 将接口的下标也添加至返回结果中</span>
	<span class="k">for</span> <span class="n">i</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">res</span><span class="o">.</span><span class="n">Interfaces</span> <span class="p">{</span>
		<span class="n">res</span><span class="o">.</span><span class="n">IPs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">Interface</span> <span class="o">=</span> <span class="n">cniTypesV1</span><span class="o">.</span><span class="n">Int</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="c">// Endpoint 也必须要同步地进行重建</span>
	<span class="n">ep</span><span class="o">.</span><span class="n">SyncBuildEndpoint</span> <span class="o">=</span> <span class="no">true</span>
	<span class="n">err</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">EndpointCreate</span><span class="p">(</span><span class="n">ep</span><span class="p">)</span>  <span class="c">// 创建 CiliumEndpoint</span>

	<span class="k">return</span> <span class="n">cniTypes</span><span class="o">.</span><span class="n">PrintResult</span><span class="p">(</span><span class="n">res</span><span class="p">,</span> <span class="n">n</span><span class="o">.</span><span class="n">CNIVersion</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在所有网络接口准备就绪后，最后一步就是创建 <a href="https://doc.crds.dev/github.com/cilium/cilium/cilium.io/CiliumEndpoint/v2@v1.14.0-snapshot.4">CiliumEndpoint</a> 资源了。创建该资源，也是由 cilium-agent 通过<code class="language-plaintext highlighter-rouge">PutEndpointID</code>向 cilium-daemon 的<code class="language-plaintext highlighter-rouge">/endpoint/{id}</code>路径发送 PUT 请求。发送请求时，携带的 Endpoint ID 为<code class="language-plaintext highlighter-rouge">cilium-local:""</code>，因为此时<code class="language-plaintext highlighter-rouge">ep.ID</code>还没有被赋值。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/client/endpoint.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Client</span><span class="p">)</span> <span class="n">EndpointCreate</span><span class="p">(</span><span class="n">ep</span> <span class="o">*</span><span class="n">models</span><span class="o">.</span><span class="n">EndpointChangeRequest</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">id</span> <span class="o">:=</span> <span class="n">pkgEndpointID</span><span class="o">.</span><span class="n">NewCiliumID</span><span class="p">(</span><span class="n">ep</span><span class="o">.</span><span class="n">ID</span><span class="p">)</span>  <span class="c">// cilium-local:$id</span>
	<span class="n">params</span> <span class="o">:=</span> <span class="n">endpoint</span><span class="o">.</span><span class="n">NewPutEndpointIDParams</span><span class="p">()</span><span class="o">.</span><span class="n">WithID</span><span class="p">(</span><span class="n">id</span><span class="p">)</span><span class="o">.</span><span class="n">WithEndpoint</span><span class="p">(</span><span class="n">ep</span><span class="p">)</span><span class="o">.</span><span class="n">WithTimeout</span><span class="p">(</span><span class="n">api</span><span class="o">.</span><span class="n">ClientTimeout</span><span class="p">)</span>  <span class="c">// 构建请求参数</span>
	<span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">Endpoint</span><span class="o">.</span><span class="n">PutEndpointID</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
	<span class="k">return</span> <span class="n">Hint</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>cilium-daemon 对应<code class="language-plaintext highlighter-rouge">/endpoint/{id}</code>路径上的 Handler 如下所示：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/cmd/endpoint.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">h</span> <span class="o">*</span><span class="n">putEndpointID</span><span class="p">)</span> <span class="n">Handle</span><span class="p">(</span><span class="n">params</span> <span class="n">PutEndpointIDParams</span><span class="p">)</span> <span class="p">(</span><span class="n">resp</span> <span class="n">middleware</span><span class="o">.</span><span class="n">Responder</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">epTemplate</span> <span class="o">:=</span> <span class="n">params</span><span class="o">.</span><span class="n">Endpoint</span>

	<span class="n">ep</span><span class="p">,</span> <span class="n">code</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">h</span><span class="o">.</span><span class="n">d</span><span class="o">.</span><span class="n">createEndpoint</span><span class="p">(</span><span class="n">params</span><span class="o">.</span><span class="n">HTTPRequest</span><span class="o">.</span><span class="n">Context</span><span class="p">(),</span> <span class="n">h</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="n">epTemplate</span><span class="p">)</span>  <span class="c">// ***</span>

	<span class="k">return</span> <span class="n">NewPutEndpointIDCreated</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">createEndpoint</code>的主要工作就是<strong>根据请求规定的内容来创建 Endpoint</strong>，其中还涉及了几点比较重要的工作：</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">AddEndpoint</code>：为 Endpoint 分配 ID，并为每个 CiliumEndpoint CRD 启动一个 controller</li>
  <li><code class="language-plaintext highlighter-rouge">UpdateLabels</code>：根据 Pod 的 Labels 生成 Endpoint 的 <a href="https://docs.cilium.io/en/stable/internals/security-identities/#security-identities">Security identities</a></li>
  <li><code class="language-plaintext highlighter-rouge">Regenerate</code>：重新生成 eBPF 程序和 Network Policy</li>
</ul>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/cmd/endpoint.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">d</span> <span class="o">*</span><span class="n">Daemon</span><span class="p">)</span> <span class="n">createEndpoint</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">owner</span> <span class="n">regeneration</span><span class="o">.</span><span class="n">Owner</span><span class="p">,</span> <span class="n">epTemplate</span> <span class="o">*</span><span class="n">models</span><span class="o">.</span><span class="n">EndpointChangeRequest</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">endpoint</span><span class="o">.</span><span class="n">Endpoint</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>

	<span class="c">// 解析请求参数并创建 Endpoint</span>
	<span class="n">ep</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">endpoint</span><span class="o">.</span><span class="n">NewEndpointFromChangeModel</span><span class="p">(</span><span class="n">d</span><span class="o">.</span><span class="n">ctx</span><span class="p">,</span> <span class="n">owner</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">d</span><span class="o">.</span><span class="n">ipcache</span><span class="p">,</span> <span class="n">d</span><span class="o">.</span><span class="n">l7Proxy</span><span class="p">,</span> <span class="n">d</span><span class="o">.</span><span class="n">identityAllocator</span><span class="p">,</span> <span class="n">epTemplate</span><span class="p">)</span>

	<span class="c">// 检查 Endpoint ID 或 Container 对应的 Endpoint 是否已经存在</span>
	<span class="n">oldEp</span> <span class="o">:=</span> <span class="n">d</span><span class="o">.</span><span class="n">endpointManager</span><span class="o">.</span><span class="n">LookupCiliumID</span><span class="p">(</span><span class="n">ep</span><span class="o">.</span><span class="n">ID</span><span class="p">)</span>
	<span class="n">oldEp</span> <span class="o">=</span> <span class="n">d</span><span class="o">.</span><span class="n">endpointManager</span><span class="o">.</span><span class="n">LookupContainerID</span><span class="p">(</span><span class="n">ep</span><span class="o">.</span><span class="n">GetContainerID</span><span class="p">())</span>

	<span class="c">// 检查 Endpoint IP 地址是否重复</span>
	<span class="k">var</span> <span class="n">checkIDs</span> <span class="p">[]</span><span class="kt">string</span>
	<span class="n">checkIDs</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">checkIDs</span><span class="p">,</span> <span class="n">endpointid</span><span class="o">.</span><span class="n">NewID</span><span class="p">(</span><span class="n">endpointid</span><span class="o">.</span><span class="n">IPv4Prefix</span><span class="p">,</span> <span class="n">ep</span><span class="o">.</span><span class="n">IPv4</span><span class="o">.</span><span class="n">String</span><span class="p">()))</span>  <span class="c">// $prefix:$ip</span>
	<span class="c">// ... aslo for ipv6</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">id</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">checkIDs</span> <span class="p">{</span>
		<span class="n">oldEp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">d</span><span class="o">.</span><span class="n">endpointManager</span><span class="o">.</span><span class="n">Lookup</span><span class="p">(</span><span class="n">id</span><span class="p">)</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="o">||</span> <span class="n">oldEp</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span> <span class="n">err</span>
		<span class="p">}</span>
	<span class="p">}</span>

        <span class="n">addLabels</span> <span class="o">:=</span> <span class="n">labels</span><span class="o">.</span><span class="n">NewLabelsFromModel</span><span class="p">(</span><span class="n">epTemplate</span><span class="o">.</span><span class="n">Labels</span><span class="p">)</span>
	<span class="n">infoLabels</span> <span class="o">:=</span> <span class="n">labels</span><span class="o">.</span><span class="n">NewLabelsFromModel</span><span class="p">([]</span><span class="kt">string</span><span class="p">{})</span>

	<span class="n">err</span> <span class="o">=</span> <span class="n">d</span><span class="o">.</span><span class="n">endpointManager</span><span class="o">.</span><span class="n">AddEndpoint</span><span class="p">(</span><span class="n">owner</span><span class="p">,</span> <span class="n">ep</span><span class="p">,</span> <span class="s">"Create endpoint from API PUT"</span><span class="p">)</span>  <span class="c">// ***</span>

	<span class="n">regenTriggered</span> <span class="o">:=</span> <span class="n">ep</span><span class="o">.</span><span class="n">UpdateLabels</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">addLabels</span><span class="p">,</span> <span class="n">infoLabels</span><span class="p">,</span> <span class="no">true</span><span class="p">)</span>  <span class="c">// ***</span>
	<span class="k">if</span> <span class="o">!</span><span class="n">regenTriggered</span> <span class="p">{</span>
		<span class="n">regenMetadata</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">regeneration</span><span class="o">.</span><span class="n">ExternalRegenerationMetadata</span><span class="p">{</span>
			<span class="n">RegenerationLevel</span><span class="o">:</span> <span class="n">regeneration</span><span class="o">.</span><span class="n">RegenerateWithDatapathRewrite</span><span class="p">,</span>
			<span class="c">// ...</span>
		<span class="p">}</span>
		<span class="n">build</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">ep</span><span class="o">.</span><span class="n">SetRegenerateStateIfAlive</span><span class="p">(</span><span class="n">regenMetadata</span><span class="p">)</span>

		<span class="k">if</span> <span class="n">build</span> <span class="p">{</span>
			<span class="n">ep</span><span class="o">.</span><span class="n">Regenerate</span><span class="p">(</span><span class="n">regenMetadata</span><span class="p">)</span>  <span class="c">// ***</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">ep</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="addendpoint">AddEndpoint</h4>
<p>此函数的调用路径如下。在为 Endpoint 分配完 ID 之后，Cilium 会<strong>为每个</strong> CiliumEndpoint（CEP）的 CRD 都开启一个 controller 用于从当前 Endpoint 同步数据。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|- AddEndpoint                                         @ pkg/endpointmanager/manager.go#L605
   |- endpointManager.expose
      |- AllocateID
      |- EndpointSynchronizer.RunK8sCiliumEndpointSync @ pkg/k8s/watchers/endpointsynchronizer.go#L49
</code></pre></div></div>
<p>CiliumEndpoint 的 controller 实现如下（有部分删减），每个 controller 的调谐执行都存在 10s 的运行间隔：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/k8s/watchers/endpointsynchronizer.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">epSync</span> <span class="o">*</span><span class="n">EndpointSynchronizer</span><span class="p">)</span> <span class="n">RunK8sCiliumEndpointSync</span><span class="p">(</span><span class="n">e</span> <span class="o">*</span><span class="n">endpoint</span><span class="o">.</span><span class="n">Endpoint</span><span class="p">,</span> <span class="n">conf</span> <span class="n">endpoint</span><span class="o">.</span><span class="n">EndpointStatusConfiguration</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">var</span> <span class="p">(</span>
		<span class="n">endpointID</span>     <span class="o">=</span> <span class="n">e</span><span class="o">.</span><span class="n">ID</span>
		<span class="n">controllerName</span> <span class="o">=</span> <span class="n">endpoint</span><span class="o">.</span><span class="n">EndpointSyncControllerName</span><span class="p">(</span><span class="n">endpointID</span><span class="p">)</span>
	<span class="p">)</span>
	<span class="n">ciliumClient</span> <span class="o">:=</span> <span class="n">epSync</span><span class="o">.</span><span class="n">Clientset</span><span class="o">.</span><span class="n">CiliumV2</span><span class="p">()</span>

	<span class="k">var</span> <span class="p">(</span>
		<span class="n">localCEP</span> <span class="o">*</span><span class="n">cilium_v2</span><span class="o">.</span><span class="n">CiliumEndpoint</span> <span class="c">// 本地 CEP 对象的副本，可以复用</span>
		<span class="n">needInit</span> <span class="o">=</span> <span class="no">true</span>                    <span class="c">// needInit 表面可能需要去创建 CEP</span>
		<span class="n">firstTry</span> <span class="o">=</span> <span class="no">true</span>                    <span class="c">// 尝试从 k8s cache 中获取 CEP 对象</span>
	<span class="p">)</span>

	<span class="n">e</span><span class="o">.</span><span class="n">UpdateController</span><span class="p">(</span><span class="n">controllerName</span><span class="p">,</span>
		<span class="n">controller</span><span class="o">.</span><span class="n">ControllerParams</span><span class="p">{</span>
			<span class="n">RunInterval</span><span class="o">:</span> <span class="m">10</span> <span class="o">*</span> <span class="n">time</span><span class="o">.</span><span class="n">Second</span><span class="p">,</span>
			<span class="n">DoFunc</span><span class="o">:</span> <span class="k">func</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="p">(</span><span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
				<span class="n">podName</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">GetK8sPodName</span><span class="p">()</span>
				<span class="n">namespace</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">GetK8sNamespace</span><span class="p">()</span>

				<span class="k">if</span> <span class="n">needInit</span> <span class="p">{</span>
					<span class="k">if</span> <span class="n">firstTry</span> <span class="p">{</span>
						<span class="c">// 首先尝试从 API server cache 中获取 CEP 对象</span>
						<span class="n">localCEP</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">ciliumClient</span><span class="o">.</span><span class="n">CiliumEndpoints</span><span class="p">(</span><span class="n">namespace</span><span class="p">)</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">podName</span><span class="p">,</span> <span class="n">meta_v1</span><span class="o">.</span><span class="n">GetOptions</span><span class="p">{</span><span class="n">ResourceVersion</span><span class="o">:</span> <span class="s">"0"</span><span class="p">})</span>
						<span class="n">firstTry</span> <span class="o">=</span> <span class="no">false</span>
					<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
						<span class="n">localCEP</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">ciliumClient</span><span class="o">.</span><span class="n">CiliumEndpoints</span><span class="p">(</span><span class="n">namespace</span><span class="p">)</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">podName</span><span class="p">,</span> <span class="n">meta_v1</span><span class="o">.</span><span class="n">GetOptions</span><span class="p">{})</span>
					<span class="p">}</span>

					<span class="k">switch</span> <span class="p">{</span>
					<span class="k">case</span> <span class="n">k8serrors</span><span class="o">.</span><span class="n">IsNotFound</span><span class="p">(</span><span class="n">err</span><span class="p">)</span><span class="o">:</span>  <span class="c">// 对于 CEP 对象不存在的情况，那就创建新的 CEP 对象</span>
						<span class="n">pod</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">GetPod</span><span class="p">()</span>
						<span class="n">cep</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">cilium_v2</span><span class="o">.</span><span class="n">CiliumEndpoint</span><span class="p">{</span>  <span class="c">// 初始化新的 CEP 对象</span>
							<span class="n">ObjectMeta</span><span class="o">:</span> <span class="n">meta_v1</span><span class="o">.</span><span class="n">ObjectMeta</span><span class="p">{</span>
								<span class="n">Name</span><span class="o">:</span> <span class="n">podName</span><span class="p">,</span>  <span class="c">// CEP 对象与 Pod 同名</span>
								<span class="n">OwnerReferences</span><span class="o">:</span> <span class="p">[]</span><span class="n">meta_v1</span><span class="o">.</span><span class="n">OwnerReference</span><span class="p">{</span>  <span class="c">// 其 owner 就是 Endpoint 对应的 Pod</span>
									<span class="p">{</span>
										<span class="n">APIVersion</span><span class="o">:</span> <span class="s">"v1"</span><span class="p">,</span>
										<span class="n">Kind</span><span class="o">:</span>       <span class="s">"Pod"</span><span class="p">,</span>
										<span class="n">Name</span><span class="o">:</span>       <span class="n">pod</span><span class="o">.</span><span class="n">GetObjectMeta</span><span class="p">()</span><span class="o">.</span><span class="n">GetName</span><span class="p">(),</span>
										<span class="n">UID</span><span class="o">:</span>        <span class="n">pod</span><span class="o">.</span><span class="n">ObjectMeta</span><span class="o">.</span><span class="n">UID</span><span class="p">,</span>
									<span class="p">},</span>
								<span class="p">},</span>
								<span class="n">Labels</span><span class="o">:</span> <span class="n">pod</span><span class="o">.</span><span class="n">GetObjectMeta</span><span class="p">()</span><span class="o">.</span><span class="n">GetLabels</span><span class="p">(),</span>
							<span class="p">},</span>
							<span class="n">Status</span><span class="o">:</span> <span class="o">*</span><span class="n">mdl</span><span class="p">,</span>
						<span class="p">}</span>
						<span class="n">localCEP</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">ciliumClient</span><span class="o">.</span><span class="n">CiliumEndpoints</span><span class="p">(</span><span class="n">namespace</span><span class="p">)</span><span class="o">.</span><span class="n">Create</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">cep</span><span class="p">,</span> <span class="n">meta_v1</span><span class="o">.</span><span class="n">CreateOptions</span><span class="p">{})</span>  <span class="c">// 创建 CEP 对象</span>
					<span class="k">default</span><span class="o">:</span>
						<span class="k">return</span> <span class="n">err</span>
					<span class="p">}</span>

					<span class="n">needInit</span> <span class="o">=</span> <span class="no">false</span>
				<span class="p">}</span>

				<span class="c">// 对于 localCEP 为 nil 的情况，先从 API server 中尝试获取最新的 CEP 对象</span>
				<span class="k">if</span> <span class="n">localCEP</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
					<span class="n">localCEP</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">ciliumClient</span><span class="o">.</span><span class="n">CiliumEndpoints</span><span class="p">(</span><span class="n">namespace</span><span class="p">)</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">podName</span><span class="p">,</span> <span class="n">meta_v1</span><span class="o">.</span><span class="n">GetOptions</span><span class="p">{})</span>
					<span class="k">switch</span> <span class="p">{</span>
					<span class="c">// 若没有找到，则说明 CEP 还未创建，此时先做标记，等下一次调谐时进行创建</span>
					<span class="k">case</span> <span class="n">k8serrors</span><span class="o">.</span><span class="n">IsNotFound</span><span class="p">(</span><span class="n">err</span><span class="p">)</span> <span class="o">||</span> <span class="n">k8serrors</span><span class="o">.</span><span class="n">IsInvalid</span><span class="p">(</span><span class="n">err</span><span class="p">)</span><span class="o">:</span>
						<span class="n">needInit</span> <span class="o">=</span> <span class="no">true</span>
						<span class="k">return</span> <span class="n">err</span>
					<span class="p">}</span>
				<span class="p">}</span>
			<span class="p">},</span>
			<span class="n">StopFunc</span><span class="o">:</span> <span class="k">func</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
				<span class="k">return</span> <span class="n">deleteCEP</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">scopedLog</span><span class="p">,</span> <span class="n">ciliumClient</span><span class="p">,</span> <span class="n">e</span><span class="p">)</span>  <span class="c">// 直接通过 ciliumClient.CiliumEndpoints(namespace).Delete 接口删除</span>
			<span class="p">},</span>
		<span class="p">})</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="updatelabels">UpdateLabels</h4>
<p>在 cilium-daemon 中，一个 Pod 的 Labels 会被分成两种类型：<code class="language-plaintext highlighter-rouge">identityLabels</code> 和<code class="language-plaintext highlighter-rouge">informationLabels</code>，即分别对应<code class="language-plaintext highlighter-rouge">addLabels</code>和<code class="language-plaintext highlighter-rouge">infoLabels</code>两个变量保存。其中，只有前者才会保存<code class="language-plaintext highlighter-rouge">identityLabels</code>。有关这些 Labels 是如何划分的，可以参考 <a href="https://github.com/cilium/cilium/blob/29211d8d1742d4c7fcabe2a79dddc521f30e2ffb/pkg/labelsfilter/filter.go#L253">labelPrefixCfg.filterLabels</a> 方法。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/endpoint/endpoint.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">e</span> <span class="o">*</span><span class="n">Endpoint</span><span class="p">)</span> <span class="n">UpdateLabels</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">identityLabels</span><span class="p">,</span> <span class="n">infoLabels</span> <span class="n">labels</span><span class="o">.</span><span class="n">Labels</span><span class="p">,</span> <span class="n">blocking</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">(</span><span class="n">regenTriggered</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">{</span>
	<span class="c">// 替换 endpoint 中的 infomation labels</span>
	<span class="n">e</span><span class="o">.</span><span class="n">replaceInformationLabels</span><span class="p">(</span><span class="n">infoLabels</span><span class="p">)</span>
	<span class="c">// 替换 identity labels，若 labels 发生变化则更新 identity；若网络发生变化则返回 identityRevision，否则返回 0</span>
	<span class="n">rev</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">replaceIdentityLabels</span><span class="p">(</span><span class="n">identityLabels</span><span class="p">)</span>
	<span class="n">e</span><span class="o">.</span><span class="n">unlock</span><span class="p">()</span>
	<span class="k">if</span> <span class="n">rev</span> <span class="o">!=</span> <span class="m">0</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">e</span><span class="o">.</span><span class="n">runIdentityResolver</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">rev</span><span class="p">,</span> <span class="n">blocking</span><span class="p">)</span>  <span class="c">// 若 identity 发生变化，则重新进行解析</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="no">false</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Security identities 的变化取决于<code class="language-plaintext highlighter-rouge">identityLabels</code>的变化。方法<code class="language-plaintext highlighter-rouge">runIdentityResolver</code>的调用栈如下所示：</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|- Endpoint.runIdentityResolver                    @ pkg/endpoint/endpoint.go
   |- Endpoint.identityLabelsChanged
      |- CachingIdentityAllocator.AllocateIdentity @ pkg/identity/cache/allocator.go
         |- Allocator.Allocate                     @ pkg/allocator/allocator.go
      |- Endpoint.SetIdentity                      @ pkg/endpoint/policy.go
         |- Endpoint.runIPIdentitySync
            |- UpsertIPToKVStore                   @ pkg/ipcache/kvstore.go
      |- Endpoint.forcePolicyComputation
</code></pre></div></div>
<p>由于 <strong>Security identities 是一个集群级别的概念</strong>，即集群内每个 Security identity 都唯一，所以 identity 需要一个集群内的全局组件来进行分配。在<code class="language-plaintext highlighter-rouge">Allocate</code>方法中可以发现，此职责由 kvstore（即 etcd）担任。<code class="language-plaintext highlighter-rouge">Allocate</code>首先根据提供的 key 到 kvstore 中查找，若没有找到任何对应的 ID 被分配，则针对此 key 创建新的 ID。若分配失败，还会进行<code class="language-plaintext highlighter-rouge">maxAllocAttempts</code>次的重试。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/allocator/allocator.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">Allocator</span><span class="p">)</span> <span class="n">Allocate</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">key</span> <span class="n">AllocatorKey</span><span class="p">)</span> <span class="p">(</span><span class="n">idpool</span><span class="o">.</span><span class="n">ID</span><span class="p">,</span> <span class="kt">bool</span><span class="p">,</span> <span class="kt">bool</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>

	<span class="k">for</span> <span class="n">attempt</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">attempt</span> <span class="o">&lt;</span> <span class="n">maxAllocAttempts</span><span class="p">;</span> <span class="n">attempt</span><span class="o">++</span> <span class="p">{</span>  <span class="c">// maxAllocAttempts 固定为 16 次</span>
		<span class="k">if</span> <span class="n">val</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">localKeys</span><span class="o">.</span><span class="n">use</span><span class="p">(</span><span class="n">k</span><span class="p">);</span> <span class="n">val</span> <span class="o">!=</span> <span class="n">idpool</span><span class="o">.</span><span class="n">NoID</span> <span class="p">{</span>  <span class="c">// identity 为 0 说明 ID 不存在</span>
			<span class="n">a</span><span class="o">.</span><span class="n">mainCache</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span>
			<span class="k">return</span> <span class="n">val</span><span class="p">,</span> <span class="no">false</span><span class="p">,</span> <span class="no">false</span><span class="p">,</span> <span class="no">nil</span>  <span class="c">// 第二个返回值表示在 kvstore 中是否有新 ID 被创建</span>
		<span class="p">}</span>

		<span class="n">value</span><span class="p">,</span> <span class="n">isNew</span><span class="p">,</span> <span class="n">firstUse</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">a</span><span class="o">.</span><span class="n">lockedAllocate</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">key</span><span class="p">)</span>  <span class="c">// 分配新的 ID</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">a</span><span class="o">.</span><span class="n">mainCache</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">)</span>
			<span class="k">return</span> <span class="n">value</span><span class="p">,</span> <span class="n">isNew</span><span class="p">,</span> <span class="n">firstUse</span><span class="p">,</span> <span class="no">nil</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="m">0</span><span class="p">,</span> <span class="no">false</span><span class="p">,</span> <span class="no">false</span><span class="p">,</span> <span class="n">err</span>
<span class="p">}</span>
</code></pre></div></div>
<p>当 Endpoint 的 identity 计算完成后，cilium-daemon 会继续通过<code class="language-plaintext highlighter-rouge">UpsertIPToKVStore</code>来更新或插入 IP-&gt;Identity 的映射关系到 kvstore：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/ipcache/kvstore.go</span>

<span class="k">func</span> <span class="n">UpsertIPToKVStore</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">IP</span><span class="p">,</span> <span class="n">hostIP</span> <span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="n">ID</span> <span class="n">identity</span><span class="o">.</span><span class="n">NumericIdentity</span><span class="p">,</span> <span class="n">key</span> <span class="kt">uint8</span><span class="p">,</span> <span class="n">metadata</span><span class="p">,</span> <span class="n">k8sNamespace</span><span class="p">,</span> <span class="n">k8sPodName</span> <span class="kt">string</span><span class="p">,</span> <span class="n">npm</span> <span class="n">types</span><span class="o">.</span><span class="n">NamedPortMap</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="c">// 按命名端口名称的字典序为端口排序</span>
	<span class="n">namedPorts</span> <span class="o">:=</span> <span class="c">// ...</span>

	<span class="n">ipKey</span> <span class="o">:=</span> <span class="n">path</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">IPIdentitiesPath</span><span class="p">,</span>  <span class="c">// =&gt; "cilium/state/ip/v1"</span>
                       <span class="n">AddressSpace</span><span class="p">,</span> <span class="n">IP</span><span class="o">.</span><span class="n">String</span><span class="p">())</span>
	<span class="n">ipIDPair</span> <span class="o">:=</span> <span class="n">identity</span><span class="o">.</span><span class="n">IPIdentityPair</span><span class="p">{</span>
		<span class="n">IP</span><span class="o">:</span>           <span class="n">IP</span><span class="p">,</span>
		<span class="n">ID</span><span class="o">:</span>           <span class="n">ID</span><span class="p">,</span>
		<span class="c">// ...</span>
		<span class="n">NamedPorts</span><span class="o">:</span>   <span class="n">namedPorts</span><span class="p">,</span>
	<span class="p">}</span>

	<span class="n">marshaledIPIDPair</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">Marshal</span><span class="p">(</span><span class="n">ipIDPair</span><span class="p">)</span>

	<span class="n">err</span> <span class="o">=</span> <span class="n">globalMap</span><span class="o">.</span><span class="n">store</span><span class="o">.</span><span class="n">upsert</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ipKey</span><span class="p">,</span> <span class="kt">string</span><span class="p">(</span><span class="n">marshaledIPIDPair</span><span class="p">),</span> <span class="no">true</span><span class="p">)</span>  <span class="c">// update/insert</span>
	<span class="k">return</span> <span class="n">err</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="regenerate">Regenerate</h4>
<p>当<code class="language-plaintext highlighter-rouge">identityLabels</code>发生变化时，重新生成的不止有 Security identity，还有<strong>该 Endpoint 对应的 eBPF 程序和 Network Policy</strong>。在<code class="language-plaintext highlighter-rouge">ep.Regenerate</code>方法中，cilium-daemon 将 regen 抽象为了一个事件并加入到了事件队列中：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/endpoint/policy.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">e</span> <span class="o">*</span><span class="n">Endpoint</span><span class="p">)</span> <span class="n">Regenerate</span><span class="p">(</span><span class="n">regenMetadata</span> <span class="o">*</span><span class="n">regeneration</span><span class="o">.</span><span class="n">ExternalRegenerationMetadata</span><span class="p">)</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="kt">bool</span> <span class="p">{</span>
	<span class="n">done</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">(</span><span class="k">chan</span> <span class="kt">bool</span><span class="p">,</span> <span class="m">1</span><span class="p">)</span>

	<span class="n">regenContext</span> <span class="o">:=</span> <span class="n">ParseExternalRegenerationMetadata</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">cFunc</span><span class="p">,</span> <span class="n">regenMetadata</span><span class="p">)</span>
	<span class="n">epEvent</span> <span class="o">:=</span> <span class="n">eventqueue</span><span class="o">.</span><span class="n">NewEvent</span><span class="p">(</span><span class="o">&amp;</span><span class="n">EndpointRegenerationEvent</span><span class="p">{</span>  <span class="c">// 创建重新生成（regen）事件</span>
		<span class="n">regenContext</span><span class="o">:</span> <span class="n">regenContext</span><span class="p">,</span>
		<span class="n">ep</span><span class="o">:</span>           <span class="n">e</span><span class="p">,</span>
	<span class="p">})</span>

	<span class="n">resChan</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">eventQueue</span><span class="o">.</span><span class="n">Enqueue</span><span class="p">(</span><span class="n">epEvent</span><span class="p">)</span>  <span class="c">// 将 regen 事件加入到事件队列中</span>

	<span class="k">go</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
		<span class="k">select</span> <span class="p">{</span>
		<span class="k">case</span> <span class="n">result</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="o">&lt;-</span><span class="n">resChan</span><span class="o">:</span>
			<span class="k">if</span> <span class="n">ok</span> <span class="p">{</span>
				<span class="n">regenResult</span> <span class="o">:=</span> <span class="n">result</span><span class="o">.</span><span class="p">(</span><span class="o">*</span><span class="n">EndpointRegenerationResult</span><span class="p">)</span>  <span class="c">// 根据 regen 事件的执行结果判断是否构建成功</span>
				<span class="n">buildSuccess</span> <span class="o">=</span> <span class="n">regenResult</span><span class="o">.</span><span class="n">err</span> <span class="o">==</span> <span class="no">nil</span>
			<span class="p">}</span>
		<span class="p">}</span>
		<span class="n">done</span> <span class="o">&lt;-</span> <span class="n">buildSuccess</span>
		<span class="nb">close</span><span class="p">(</span><span class="n">done</span><span class="p">)</span>
	<span class="p">}()</span>

	<span class="k">return</span> <span class="n">done</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Cilium 在运行事件队列的地方消费事件，各种不同的事件类型都实现了<code class="language-plaintext highlighter-rouge">EventHandler</code>接口定义的方法：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/eventqueue/eventqueue.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">q</span> <span class="o">*</span><span class="n">EventQueue</span><span class="p">)</span> <span class="n">Run</span><span class="p">()</span> <span class="p">{</span>  <span class="c">// 事件消费</span>
	<span class="k">go</span> <span class="n">q</span><span class="o">.</span><span class="n">run</span><span class="p">()</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">q</span> <span class="o">*</span><span class="n">EventQueue</span><span class="p">)</span> <span class="n">run</span><span class="p">()</span> <span class="p">{</span>
	<span class="n">q</span><span class="o">.</span><span class="n">eventQueueOnce</span><span class="o">.</span><span class="n">Do</span><span class="p">(</span><span class="k">func</span><span class="p">()</span> <span class="p">{</span>
		<span class="k">defer</span> <span class="nb">close</span><span class="p">(</span><span class="n">q</span><span class="o">.</span><span class="n">eventsClosed</span><span class="p">)</span>
		<span class="k">for</span> <span class="n">ev</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">q</span><span class="o">.</span><span class="n">events</span> <span class="p">{</span>
			<span class="k">select</span> <span class="p">{</span>
			<span class="k">default</span><span class="o">:</span>
				<span class="n">ev</span><span class="o">.</span><span class="n">Metadata</span><span class="o">.</span><span class="n">Handle</span><span class="p">(</span><span class="n">ev</span><span class="o">.</span><span class="n">eventResults</span><span class="p">)</span>  <span class="c">// 事件处理</span>
				<span class="nb">close</span><span class="p">(</span><span class="n">ev</span><span class="o">.</span><span class="n">eventResults</span><span class="p">)</span>
			<span class="p">}</span>
		<span class="p">}</span>
	<span class="p">})</span>
<span class="p">}</span>

<span class="k">type</span> <span class="n">EventHandler</span> <span class="k">interface</span> <span class="p">{</span>
	<span class="n">Handle</span><span class="p">(</span><span class="k">chan</span> <span class="k">interface</span><span class="p">{})</span>
<span class="p">}</span>
</code></pre></div></div>
<p><strong>eBPF 程序的生成其实就是一系列文件操作</strong>。在<code class="language-plaintext highlighter-rouge">EndpointRegenerationEvent</code>定义的<code class="language-plaintext highlighter-rouge">Handle</code>中，其最后就是调用<code class="language-plaintext highlighter-rouge">Endpoint.regenerate</code>方法，该方法首先获取两个目录：<code class="language-plaintext highlighter-rouge">State</code>和<code class="language-plaintext highlighter-rouge">Next</code>。其中，后者属于临时目录，在每次生成过程中先创建然后再删除；而前者则由 cilium-daemon 配置指定，其默认位于<code class="language-plaintext highlighter-rouge">/var/run/cilium</code>目录下。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/endpoint/policy.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">e</span> <span class="o">*</span><span class="n">Endpoint</span><span class="p">)</span> <span class="n">regenerate</span><span class="p">(</span><span class="n">ctx</span> <span class="o">*</span><span class="n">regenerationContext</span><span class="p">)</span> <span class="p">(</span><span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>

	<span class="n">origDir</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">StateDirectoryPath</span><span class="p">()</span>
	<span class="n">ctx</span><span class="o">.</span><span class="n">datapathRegenerationContext</span><span class="o">.</span><span class="n">currentDir</span> <span class="o">=</span> <span class="n">origDir</span>  <span class="c">// $(daemonConfig.StateDir)/$(ep.StringID)</span>

	<span class="c">// temporary 目录用于保存生成的头文件</span>
	<span class="n">tmpDir</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">NextDirectoryPath</span><span class="p">()</span>
	<span class="n">ctx</span><span class="o">.</span><span class="n">datapathRegenerationContext</span><span class="o">.</span><span class="n">nextDir</span> <span class="o">=</span> <span class="n">tmpDir</span>  <span class="c">// ./$(ep.StringID)_next</span>

	<span class="c">// 移除现有的 temporary 目录</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">removeDirectory</span><span class="p">(</span><span class="n">tmpDir</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">os</span><span class="o">.</span><span class="n">IsNotExist</span><span class="p">(</span><span class="n">err</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">return</span> <span class="c">// err</span>
	<span class="p">}</span>

	<span class="c">// 创建 temporary 目录</span>
	<span class="n">err</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">MkdirAll</span><span class="p">(</span><span class="n">tmpDir</span><span class="p">,</span> <span class="m">0777</span><span class="p">)</span>

	<span class="k">defer</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
		<span class="n">e</span><span class="o">.</span><span class="n">removeDirectory</span><span class="p">(</span><span class="n">tmpDir</span><span class="p">)</span>
	<span class="p">}()</span>

	<span class="n">revision</span><span class="p">,</span> <span class="n">stateDirComplete</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">e</span><span class="o">.</span><span class="n">regenerateBPF</span><span class="p">(</span><span class="n">ctx</span><span class="p">)</span>  <span class="c">// ***</span>

	<span class="c">// 将所有 verifier 的日志写入到 temporary 目录下</span>
	<span class="k">var</span> <span class="n">ve</span> <span class="o">*</span><span class="n">ebpf</span><span class="o">.</span><span class="n">VerifierError</span>
	<span class="k">if</span> <span class="n">errors</span><span class="o">.</span><span class="n">As</span><span class="p">(</span><span class="n">err</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ve</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">p</span> <span class="o">:=</span> <span class="n">path</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">tmpDir</span><span class="p">,</span> <span class="s">"verifier.log"</span><span class="p">)</span>
		<span class="n">f</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">Create</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
        <span class="n">err</span> <span class="o">:=</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Fprintf</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="s">"%+v</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">ve</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">e</span><span class="o">.</span><span class="n">updateRealizedState</span><span class="p">(</span><span class="n">stats</span><span class="p">,</span> <span class="n">origDir</span><span class="p">,</span> <span class="n">revision</span><span class="p">,</span> <span class="n">stateDirComplete</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">regenerateBPF</code>方法的核心调用栈如下所示，其主要分为两步：</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">runPreCompilationSteps</code>用于在 BPF 程序编译之前运行所有有关此次重建的必要步骤，其中<strong>重点就是头文件的编写</strong></li>
  <li><code class="language-plaintext highlighter-rouge">realizeBPFState</code>用于为 Endpoint 编译并安装 eBPF 程序，根据<a href="https://github.com/cilium/cilium/blob/29211d8d1742d4c7fcabe2a79dddc521f30e2ffb/pkg/endpoint/regeneration/regeneration_context.go#L14">重建程度</a>的不同以调用 Loader 不同的方法</li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|- Endpoint.regenerateBPF                      @ pkg/endpoint/bpf.go
   |- Endpoint.runPreCompilationSteps
      ｜- writeHeaderfile
          |- writeInformationalComments
          |- WriteEndpointConfig               @ pkg/datapath/linux/config/config.go
             |- writeIncludes
             |- writeStaticData
             |- writeTemplateConfig
   |- Endpoint.realizeBPFState
      |- Loader.CompileAndLoad   # if          @ pkg/datapath/loader/loader.go
       - Loader.CompileOrLoad    # elif
       - Loader.ReloadDatapath   # else
</code></pre></div></div>
<h5 id="compileandload">CompileAndLoad</h5>
<p>下文以<code class="language-plaintext highlighter-rouge">Loader.CompileAndLoad</code>方法为例，分析其主要工作：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/datapath/loader/loader.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">Loader</span><span class="p">)</span> <span class="n">CompileAndLoad</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ep</span> <span class="n">datapath</span><span class="o">.</span><span class="n">Endpoint</span><span class="p">,</span> <span class="n">stats</span> <span class="o">*</span><span class="n">metrics</span><span class="o">.</span><span class="n">SpanStat</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">dirs</span> <span class="o">:=</span> <span class="n">directoryInfo</span><span class="p">{</span>
		<span class="n">Library</span><span class="o">:</span> <span class="n">option</span><span class="o">.</span><span class="n">Config</span><span class="o">.</span><span class="n">BpfDir</span><span class="p">,</span>     <span class="c">// /var/lib/cilium/bpf，存放 BPF 模版文件</span>
		<span class="n">Runtime</span><span class="o">:</span> <span class="n">option</span><span class="o">.</span><span class="n">Config</span><span class="o">.</span><span class="n">StateDir</span><span class="p">,</span>
		<span class="n">State</span><span class="o">:</span>   <span class="n">ep</span><span class="o">.</span><span class="n">StateDir</span><span class="p">(),</span>
		<span class="n">Output</span><span class="o">:</span>  <span class="n">ep</span><span class="o">.</span><span class="n">StateDir</span><span class="p">(),</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">l</span><span class="o">.</span><span class="n">compileAndLoad</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ep</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">dirs</span><span class="p">,</span> <span class="n">stats</span><span class="p">)</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">Loader</span><span class="p">)</span> <span class="n">compileAndLoad</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ep</span> <span class="n">datapath</span><span class="o">.</span><span class="n">Endpoint</span><span class="p">,</span> <span class="n">dirs</span> <span class="o">*</span><span class="n">directoryInfo</span><span class="p">,</span> <span class="n">stats</span> <span class="o">*</span><span class="n">metrics</span><span class="o">.</span><span class="n">SpanStat</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">err</span> <span class="o">:=</span> <span class="n">compileDatapath</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">dirs</span><span class="p">,</span> <span class="n">ep</span><span class="o">.</span><span class="n">IsHost</span><span class="p">(),</span> <span class="n">ep</span><span class="o">.</span><span class="n">Logger</span><span class="p">(</span><span class="n">Subsystem</span><span class="p">))</span>  <span class="c">// 工作1</span>
	<span class="n">err</span> <span class="o">=</span> <span class="n">l</span><span class="o">.</span><span class="n">reloadDatapath</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ep</span><span class="p">,</span> <span class="n">dirs</span><span class="p">)</span>  <span class="c">// 工作2</span>
	<span class="k">return</span> <span class="n">err</span>
<span class="p">}</span>
</code></pre></div></div>
<p>它首先通过<code class="language-plaintext highlighter-rouge">compileDatapath</code>函数来为 BPF 的 datapath 调用编译器和链接器创建所有的 state 文件，这些<strong>文件的最终编译目标都为 ELF 二进制格式</strong>。编译过程也分为两次程序调用：clang 先生成 LLVM 比特码，llc 再将其编译为字节码。</p>

<p>编译程序的源文件为<code class="language-plaintext highlighter-rouge">bpf_lxc.c</code>（可见<code class="language-plaintext highlighter-rouge">{cilium}/bpf/bpf_lxc.c</code>），编译的结果存储在<code class="language-plaintext highlighter-rouge">/var/run/cilium/state/${ID}</code>之下。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">compileDatapath</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">dirs</span> <span class="o">*</span><span class="n">directoryInfo</span><span class="p">,</span> <span class="n">isHost</span> <span class="kt">bool</span><span class="p">,</span> <span class="n">logger</span> <span class="o">*</span><span class="n">logrus</span><span class="o">.</span><span class="n">Entry</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>

	<span class="n">versionCmd</span> <span class="o">:=</span> <span class="n">exec</span><span class="o">.</span><span class="n">CommandContext</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">compiler</span><span class="p">,</span> <span class="s">"--version"</span><span class="p">)</span>
	<span class="n">compilerVersion</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">versionCmd</span><span class="o">.</span><span class="n">CombinedOutput</span><span class="p">(</span><span class="n">scopedLog</span><span class="p">,</span> <span class="no">true</span><span class="p">)</span>  <span class="c">// 检查编译器的状态</span>

	<span class="n">versionCmd</span> <span class="o">=</span> <span class="n">exec</span><span class="o">.</span><span class="n">CommandContext</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">linker</span><span class="p">,</span> <span class="s">"--version"</span><span class="p">)</span>
	<span class="n">linkerVersion</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">versionCmd</span><span class="o">.</span><span class="n">CombinedOutput</span><span class="p">(</span><span class="n">scopedLog</span><span class="p">,</span> <span class="no">true</span><span class="p">)</span>  <span class="c">// 检查链接器的状态</span>

	<span class="c">// 编译新的程序</span>
	<span class="n">prog</span> <span class="o">:=</span> <span class="n">epProg</span>  <span class="c">// =&gt; struct epProg = {Source: "bpf_lxc.c", Output: "bpf_lxc.o", OutputType: "obj"}</span>
	<span class="n">compile</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">prog</span><span class="p">,</span> <span class="n">dirs</span><span class="p">)</span>

	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">compile</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">prog</span> <span class="o">*</span><span class="n">progInfo</span><span class="p">,</span> <span class="n">dir</span> <span class="o">*</span><span class="n">directoryInfo</span><span class="p">)</span> <span class="p">(</span><span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">args</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="kt">string</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">16</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">prog</span><span class="o">.</span><span class="n">OutputType</span> <span class="o">==</span> <span class="n">outputSource</span> <span class="p">{</span>
		<span class="n">args</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="s">"-E"</span><span class="p">)</span> <span class="c">// Preprocessor</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="n">args</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="s">"-emit-llvm"</span><span class="p">)</span>
		<span class="n">args</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="s">"-g"</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="c">// 追加各种编译参数</span>
	<span class="n">args</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="n">standardCFlags</span><span class="o">...</span><span class="p">)</span>
	<span class="n">args</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="n">prog</span><span class="o">.</span><span class="n">Options</span><span class="o">...</span><span class="p">)</span>
	<span class="n">args</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="n">progCFlags</span><span class="p">(</span><span class="n">prog</span><span class="p">,</span> <span class="n">dir</span><span class="p">)</span><span class="o">...</span><span class="p">)</span>

	<span class="k">switch</span> <span class="n">prog</span><span class="o">.</span><span class="n">OutputType</span> <span class="p">{</span>
	<span class="k">case</span> <span class="n">outputSource</span><span class="o">:</span>
		<span class="n">compileCmd</span> <span class="o">:=</span> <span class="n">exec</span><span class="o">.</span><span class="n">CommandContext</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">compiler</span><span class="p">,</span> <span class="n">args</span><span class="o">...</span><span class="p">)</span>
		<span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">compileCmd</span><span class="o">.</span><span class="n">CombinedOutput</span><span class="p">(</span><span class="n">log</span><span class="p">,</span> <span class="no">true</span><span class="p">)</span>
	<span class="k">case</span> <span class="n">outputObject</span><span class="p">,</span> <span class="n">outputAssembly</span><span class="o">:</span>
		<span class="n">err</span> <span class="o">=</span> <span class="n">compileAndLink</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">prog</span><span class="p">,</span> <span class="n">dir</span><span class="p">,</span> <span class="n">args</span><span class="o">...</span><span class="p">)</span>  <span class="c">// 编译执行与链接</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">err</span>
<span class="p">}</span>
</code></pre></div></div>
<p>其次再通过<code class="language-plaintext highlighter-rouge">reloadDatapath</code>方法来重载 BPF 程序，该方法的核心调用栈如下所示，其主要是将 BPF 程序加载到与 Endpoint 关联的网络接口上。该 BPF 程序的加载是通过 linux 内核工具<code class="language-plaintext highlighter-rouge">tc</code>（traffic control）来实现的。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|- Loader.reloadDatapath   @ pkg/datapath/loader/loader.go
   |- replaceDatapath      @ pkg/datapath/loader/netlink.go
      |- attachProgram
         |- replaceQdisc
</code></pre></div></div>
<h2 id="删除网络">删除网络</h2>
<p>相比于 CNI ADD 动作，CNI DEL 动作就相对简单了不少：它负责将在 CNI ADD 中创建的 Endpoint、IP 和网络接口统统移除。由于其所涉及的工作方式与 CNI ADD 动作类似，故本节不再展开详细的描述。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">cmdDel</span><span class="p">(</span><span class="n">args</span> <span class="o">*</span><span class="n">skel</span><span class="o">.</span><span class="n">CmdArgs</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">n</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">types</span><span class="o">.</span><span class="n">LoadNetConf</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">StdinData</span><span class="p">)</span>

	<span class="n">cniArgs</span> <span class="o">:=</span> <span class="n">types</span><span class="o">.</span><span class="n">ArgsSpec</span><span class="p">{}</span>
	<span class="n">cniTypes</span><span class="o">.</span><span class="n">LoadArgs</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">Args</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">cniArgs</span><span class="p">)</span>  <span class="c">// 提取 CNI 参数</span>

	<span class="n">c</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">lib</span><span class="o">.</span><span class="n">NewDeletionFallbackClient</span><span class="p">(</span><span class="n">logger</span><span class="p">)</span>  <span class="c">// 初始化 client</span>

	<span class="n">id</span> <span class="o">:=</span> <span class="n">endpointid</span><span class="o">.</span><span class="n">NewID</span><span class="p">(</span><span class="n">endpointid</span><span class="o">.</span><span class="n">ContainerIdPrefix</span><span class="p">,</span> <span class="n">args</span><span class="o">.</span><span class="n">ContainerID</span><span class="p">)</span>  <span class="c">// Prefix: "container-id"</span>
	<span class="n">c</span><span class="o">.</span><span class="n">EndpointDelete</span><span class="p">(</span><span class="n">id</span><span class="p">)</span>  <span class="c">// 删除 Endpoint</span>

	<span class="k">if</span> <span class="n">n</span><span class="o">.</span><span class="n">IPAM</span><span class="o">.</span><span class="n">Type</span> <span class="o">!=</span> <span class="s">""</span> <span class="p">{</span>
		<span class="n">err</span> <span class="o">=</span> <span class="n">cniInvoke</span><span class="o">.</span><span class="n">DelegateDel</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">TODO</span><span class="p">(),</span> <span class="n">n</span><span class="o">.</span><span class="n">IPAM</span><span class="o">.</span><span class="n">Type</span><span class="p">,</span> <span class="n">args</span><span class="o">.</span><span class="n">StdinData</span><span class="p">,</span> <span class="no">nil</span><span class="p">)</span>  <span class="c">// 释放 IP</span>
	<span class="p">}</span>

	<span class="n">netNs</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">ns</span><span class="o">.</span><span class="n">GetNS</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">Netns</span><span class="p">)</span>
	<span class="k">defer</span> <span class="n">netNs</span><span class="o">.</span><span class="n">Close</span><span class="p">()</span>
	<span class="n">err</span> <span class="o">=</span> <span class="n">netns</span><span class="o">.</span><span class="n">RemoveIfFromNetNSIfExists</span><span class="p">(</span><span class="n">netNs</span><span class="p">,</span> <span class="n">args</span><span class="o">.</span><span class="n">IfName</span><span class="p">)</span>  <span class="c">// 移除网络命名空间中的接口</span>

	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="总结">总结</h2>
<p>本文围绕 cilium-cni 的主要能力展开了简单的分析，cilium-cni 本身并没有难以理解的地方。相反，cilium-daemon 作为 CNI 能力的来源，其设计就复杂了许多。本文对于 cilium-daemon 的探究很多时候都是点到为止，尤其是在“Endpoint 创建”相关的章节。因为篇幅原因以及作者水平实在有限，许多问题都没能深入展开，比如：</p>

<ul>
  <li>cilium-cni 加载的这个 BPF 程序提供了哪些网络能力？即<code class="language-plaintext highlighter-rouge">bpf_lxc.c</code>涉及到的网络工作原理</li>
  <li><del>BPF 程序加载到网络接口是如何配合 tc 来完成的？具体涉及哪些操作？</del> 详见 <a href="https://shawnh2.github.io/post/2023/08/09/cilium-tc-reload-datapath.html">tc ReloadDatapath 博客</a>的分析</li>
  <li>Endpoint 的 Security identity 发生变化时，其 Network Policy 又是如何变化的？其又是如何计算的？</li>
</ul>

<h2 id="reference">Reference</h2>

<ol>
  <li><a href="https://docs.cilium.io/en/stable/network/concepts/ipam/kubernetes/">https://docs.cilium.io/en/stable/network/concepts/ipam/kubernetes/</a></li>
  <li><a href="https://docs.cilium.io/en/stable/network/concepts/ipam/deep_dive/">https://docs.cilium.io/en/stable/network/concepts/ipam/deep_dive/</a></li>
  <li><a href="https://docs.cilium.io/en/stable/internals/security-identities/">https://docs.cilium.io/en/stable/internals/security-identities/</a></li>
  <li><a href="http://arthurchiao.art/blog/cilium-code-cni-create-network/">http://arthurchiao.art/blog/cilium-code-cni-create-network/</a></li>
  <li><a href="https://www.cni.dev/docs/spec/#section-4-plugin-delegation">https://www.cni.dev/docs/spec/#section-4-plugin-delegation</a></li>
</ol>]]></content><author><name>Your Name</name><email>shawnhxh@outlook.com</email></author><category term="post" /><category term="Network" /><category term="CNI" /><category term="Cilium" /><summary type="html"><![CDATA[本文代码基于 Cilium HEAD 4093531，主要围绕 Cilium CNI 的 Operation 展开。 添加网络 Cilium CNI 对于 ADD Operation 的操作定义在plugins/cilium-cni/main.go中，并由cmdAdd函数描述，该函数主要负责为 Pod 创建网络，其整体的控制时序流如下图所示。下图中在 IP 地址分配环节，描述了三种 IPAM 方式（host-scope、crd 和 eni），本文只关注 host-scope 这种默认的分配方式，即标记了红色背景的流程部分。 由于cmdAdd函数内容较多，下文将分段对其中重要的部分进行分析。]]></summary></entry><entry><title type="html">The Garbage Collection of Pods</title><link href="https://shawnh2.github.io/post/2023/07/10/pod-gc.html" rel="alternate" type="text/html" title="The Garbage Collection of Pods" /><published>2023-07-10T00:00:00+08:00</published><updated>2023-07-10T00:00:00+08:00</updated><id>https://shawnh2.github.io/post/2023/07/10/pod-gc</id><content type="html" xml:base="https://shawnh2.github.io/post/2023/07/10/pod-gc.html"><![CDATA[<blockquote>
  <p>本文代码基于 <a href="https://github.com/kubernetes/kubernetes/tree/release-1.27">Kubernetes v1.27</a> 展开。</p>
</blockquote>

<p>在 K8s 中，对于执行或调度失败的 Pods 来说，它的 API 对象还依然会存在于集群中。及时的清理掉这些对象以防止资源泄露，就变得尤其重要。K8s 中存在一个名为 Pod GC 的 controller 专门负责回收这种对象，在已终止 Pods 的数量达到 kube-controller-manager 设置的<code class="language-plaintext highlighter-rouge">terminated-pod-gc-threshold</code>阈值之后，Pod GC 便会开始清理工作，见<code class="language-plaintext highlighter-rouge">gcTerminated</code>。</p>

<p>另外，Pod GC 也会清理符合以下条件的任何 Pods：</p>

<ul>
  <li>是孤儿 Pods，即绑定到了一个已经不存在的 Node 上，见<code class="language-plaintext highlighter-rouge">gcOrphaned</code></li>
  <li>是未经调度过就终止的 Pods，见<code class="language-plaintext highlighter-rouge">gcUnscheduledTerminating</code></li>
  <li>是正在终止的 Pods，并绑定到了一个未 Ready 且带有<code class="language-plaintext highlighter-rouge">node.kubernetes.io/out-of-service</code>污点的 Node 上，见<code class="language-plaintext highlighter-rouge">gcTerminating</code>（启用<code class="language-plaintext highlighter-rouge">NodeOutOfServiceVolumeDetach</code>特性后）</li>
</ul>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/controller/podgc/gc_controller.go</span>

<span class="c">// Pod GC controller 最终使用的方法</span>
<span class="k">func</span> <span class="p">(</span><span class="n">gcc</span> <span class="o">*</span><span class="n">PodGCController</span><span class="p">)</span> <span class="n">gc</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="p">{</span>
	<span class="c">// 列举出当前集群中所有 pod 和 node 的资源</span>
	<span class="n">pods</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">gcc</span><span class="o">.</span><span class="n">podLister</span><span class="o">.</span><span class="n">List</span><span class="p">(</span><span class="n">labels</span><span class="o">.</span><span class="n">Everything</span><span class="p">())</span>
	<span class="n">nodes</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">gcc</span><span class="o">.</span><span class="n">nodeLister</span><span class="o">.</span><span class="n">List</span><span class="p">(</span><span class="n">labels</span><span class="o">.</span><span class="n">Everything</span><span class="p">())</span>

	<span class="k">if</span> <span class="n">gcc</span><span class="o">.</span><span class="n">terminatedPodThreshold</span> <span class="o">&gt;</span> <span class="m">0</span> <span class="p">{</span> <span class="c">// 该阈值小于等于0，说明不启用 Pod GC，只进行一些其他的回收工作</span>
		<span class="n">gcc</span><span class="o">.</span><span class="n">gcTerminated</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pods</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="k">if</span> <span class="n">utilfeature</span><span class="o">.</span><span class="n">DefaultFeatureGate</span><span class="o">.</span><span class="n">Enabled</span><span class="p">(</span><span class="n">features</span><span class="o">.</span><span class="n">NodeOutOfServiceVolumeDetach</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">gcc</span><span class="o">.</span><span class="n">gcTerminating</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pods</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="n">gcc</span><span class="o">.</span><span class="n">gcOrphaned</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pods</span><span class="p">,</span> <span class="n">nodes</span><span class="p">)</span>
	<span class="n">gcc</span><span class="o">.</span><span class="n">gcUnscheduledTerminating</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pods</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<!--more-->

<h2 id="回收过程">回收过程</h2>
<h3 id="gcterminated">gcTerminated</h3>
<p>对于正常的 Pods 回收工作而言，需要关注的就是<strong>如何定义一个 Pod 的状态为已终止（terminated）</strong>？在 Pod GC 中，Pod 的已终止状态被描述为<strong>处于 Successed 或 Failed 阶段（phase）的 Pod</strong>。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">isPodTerminated</span><span class="p">(</span><span class="n">pod</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">)</span> <span class="kt">bool</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">phase</span> <span class="o">:=</span> <span class="n">pod</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">Phase</span><span class="p">;</span> <span class="n">phase</span> <span class="o">!=</span> <span class="n">v1</span><span class="o">.</span><span class="n">PodPending</span> <span class="o">&amp;&amp;</span> <span class="n">phase</span> <span class="o">!=</span> <span class="n">v1</span><span class="o">.</span><span class="n">PodRunning</span> <span class="o">&amp;&amp;</span> <span class="n">phase</span> <span class="o">!=</span> <span class="n">v1</span><span class="o">.</span><span class="n">PodUnknown</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">true</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="no">false</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在删除这些 Pod 对象时，每一个删除动作都由一个 goroutine 启动：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">gcc</span> <span class="o">*</span><span class="n">PodGCController</span><span class="p">)</span> <span class="n">gcTerminated</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">pods</span> <span class="p">[]</span><span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">terminatedPods</span> <span class="o">:=</span> <span class="p">[]</span><span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">{}</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">pod</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">pods</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">isPodTerminated</span><span class="p">(</span><span class="n">pod</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">terminatedPods</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">terminatedPods</span><span class="p">,</span> <span class="n">pod</span><span class="p">)</span>  <span class="c">// 收集所有处于已终止状态的 pods</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="n">terminatedPodCount</span> <span class="o">:=</span> <span class="nb">len</span><span class="p">(</span><span class="n">terminatedPods</span><span class="p">)</span>
	<span class="n">deleteCount</span> <span class="o">:=</span> <span class="n">terminatedPodCount</span> <span class="o">-</span> <span class="n">gcc</span><span class="o">.</span><span class="n">terminatedPodThreshold</span>
	<span class="k">if</span> <span class="n">deleteCount</span> <span class="o">&lt;=</span> <span class="m">0</span> <span class="p">{</span>  <span class="c">// 不及 pod 回收的阈值时，就终止此次回收</span>
		<span class="k">return</span>
	<span class="p">}</span>

	<span class="n">sort</span><span class="o">.</span><span class="n">Sort</span><span class="p">(</span><span class="n">byEvictionAndCreationTimestamp</span><span class="p">(</span><span class="n">terminatedPods</span><span class="p">))</span>  <span class="c">// 按驱逐状态和 pod 创建时间戳排序</span>
	<span class="k">var</span> <span class="n">wait</span> <span class="n">sync</span><span class="o">.</span><span class="n">WaitGroup</span>
	<span class="k">for</span> <span class="n">i</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">deleteCount</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span> <span class="p">{</span>
		<span class="n">wait</span><span class="o">.</span><span class="n">Add</span><span class="p">(</span><span class="m">1</span><span class="p">)</span>
		<span class="k">go</span> <span class="k">func</span><span class="p">(</span><span class="n">pod</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">)</span> <span class="p">{</span>
			<span class="k">defer</span> <span class="n">wait</span><span class="o">.</span><span class="n">Done</span><span class="p">()</span>
			<span class="n">gcc</span><span class="o">.</span><span class="n">markFailedAndDeletePod</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pod</span><span class="p">)</span>  <span class="c">// 执行删除</span>
		<span class="p">}(</span><span class="n">terminatedPods</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
	<span class="p">}</span>
	<span class="n">wait</span><span class="o">.</span><span class="n">Wait</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="gcorphaned">gcOrphaned</h3>
<p>对于孤儿 Pods 的检测，实际上就是对 Pod spec 的<code class="language-plaintext highlighter-rouge">NodeName</code>是否被赋值、若赋值了是否属于已知 Node 的 Name 来进行检测的。那么对于含有未知<code class="language-plaintext highlighter-rouge">NodeName</code>的 Pods，<strong>Pod GC 并非直接认为这些 Pods 属于孤儿</strong>，而是在等待一个<code class="language-plaintext highlighter-rouge">quarantineTime</code>隔离周期（40s）之后，再去判断该<code class="language-plaintext highlighter-rouge">NodeName</code>还是否生效。若依旧<strong>不生效</strong>，才认为这些 Pods 为孤儿并进行删除。</p>

<p>Pod GC 引入一个隔离期的目的，其实就是<strong>为了防止 Node 不是真的不存在而是处于还未 Ready 状态</strong>的情况，避免有些 Pod 在 Node 进入 Ready 之前被误删。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">gcc</span> <span class="o">*</span><span class="n">PodGCController</span><span class="p">)</span> <span class="n">gcOrphaned</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">pods</span> <span class="p">[]</span><span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">,</span> <span class="n">nodes</span> <span class="p">[]</span><span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Node</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">existingNodeNames</span> <span class="o">:=</span> <span class="n">sets</span><span class="o">.</span><span class="n">NewString</span><span class="p">()</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">node</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">nodes</span> <span class="p">{</span>
		<span class="n">existingNodeNames</span><span class="o">.</span><span class="n">Insert</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">Name</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="c">// 将新找到的、未知的 node 进行隔离</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">pod</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">pods</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">pod</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">NodeName</span> <span class="o">!=</span> <span class="s">""</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">existingNodeNames</span><span class="o">.</span><span class="n">Has</span><span class="p">(</span><span class="n">pod</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">NodeName</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">gcc</span><span class="o">.</span><span class="n">nodeQueue</span><span class="o">.</span><span class="n">AddAfter</span><span class="p">(</span><span class="n">pod</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">NodeName</span><span class="p">,</span> <span class="n">gcc</span><span class="o">.</span><span class="n">quarantineTime</span><span class="p">)</span> <span class="c">// 在经过 quarantineTime 的隔离期之后再加入 node 队列</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="c">// 检查 node 在隔离期之后是否还属于未知状态</span>
	<span class="n">deletedNodesNames</span><span class="p">,</span> <span class="n">quit</span> <span class="o">:=</span> <span class="n">gcc</span><span class="o">.</span><span class="n">discoverDeletedNodes</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">existingNodeNames</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">quit</span> <span class="p">{</span>
		<span class="k">return</span>
	<span class="p">}</span>

	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">pod</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">pods</span> <span class="p">{</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">deletedNodesNames</span><span class="o">.</span><span class="n">Has</span><span class="p">(</span><span class="n">pod</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">NodeName</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// 将不属于任何 node 的 pod 删除</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="n">condition</span> <span class="o">:=</span> <span class="n">corev1apply</span><span class="o">.</span><span class="n">PodCondition</span><span class="p">()</span><span class="o">.</span>
			<span class="n">WithType</span><span class="p">(</span><span class="n">v1</span><span class="o">.</span><span class="n">DisruptionTarget</span><span class="p">)</span><span class="o">.</span>
			<span class="n">WithStatus</span><span class="p">(</span><span class="n">v1</span><span class="o">.</span><span class="n">ConditionTrue</span><span class="p">)</span><span class="o">.</span>
			<span class="n">WithReason</span><span class="p">(</span><span class="s">"DeletionByPodGC"</span><span class="p">)</span><span class="o">.</span>
			<span class="n">WithMessage</span><span class="p">(</span><span class="s">"PodGC: node no longer exists"</span><span class="p">)</span><span class="o">.</span>
			<span class="n">WithLastTransitionTime</span><span class="p">(</span><span class="n">metav1</span><span class="o">.</span><span class="n">Now</span><span class="p">())</span>
		<span class="n">gcc</span><span class="o">.</span><span class="n">markFailedAndDeletePodWithCondition</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pod</span><span class="p">,</span> <span class="n">condition</span><span class="p">)</span>  <span class="c">// 执行删除</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在隔离期结束后，若<code class="language-plaintext highlighter-rouge">NodeName</code>仍然不属于任何的 Node，则考虑将属于该 Node 上的 Pod 进行删除：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">gcc</span> <span class="o">*</span><span class="n">PodGCController</span><span class="p">)</span> <span class="n">discoverDeletedNodes</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">existingNodeNames</span> <span class="n">sets</span><span class="o">.</span><span class="n">String</span><span class="p">)</span> <span class="p">(</span><span class="n">sets</span><span class="o">.</span><span class="n">String</span><span class="p">,</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">deletedNodesNames</span> <span class="o">:=</span> <span class="n">sets</span><span class="o">.</span><span class="n">NewString</span><span class="p">()</span>
	<span class="k">for</span> <span class="n">gcc</span><span class="o">.</span><span class="n">nodeQueue</span><span class="o">.</span><span class="n">Len</span><span class="p">()</span> <span class="o">&gt;</span> <span class="m">0</span> <span class="p">{</span>
		<span class="n">item</span><span class="p">,</span> <span class="n">quit</span> <span class="o">:=</span> <span class="n">gcc</span><span class="o">.</span><span class="n">nodeQueue</span><span class="o">.</span><span class="n">Get</span><span class="p">()</span>
		<span class="k">if</span> <span class="n">quit</span> <span class="p">{</span>
			<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="no">true</span>  <span class="c">// quit</span>
		<span class="p">}</span>
		<span class="n">nodeName</span> <span class="o">:=</span> <span class="n">item</span><span class="o">.</span><span class="p">(</span><span class="kt">string</span><span class="p">)</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">existingNodeNames</span><span class="o">.</span><span class="n">Has</span><span class="p">(</span><span class="n">nodeName</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// 仍然属于未知的 node 的话</span>
			<span class="n">exists</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">gcc</span><span class="o">.</span><span class="n">checkIfNodeExists</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">nodeName</span><span class="p">)</span> <span class="c">// 通过 kube-client 检查对应 node 是否真实存在</span>
			<span class="k">switch</span> <span class="p">{</span>
			<span class="k">case</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span><span class="o">:</span>
				<span class="c">// ...</span>
			<span class="k">case</span> <span class="o">!</span><span class="n">exists</span><span class="o">:</span>
				<span class="c">// 对于不存在的 node，加入到删除名单中</span>
				<span class="n">deletedNodesNames</span><span class="o">.</span><span class="n">Insert</span><span class="p">(</span><span class="n">nodeName</span><span class="p">)</span>
			<span class="p">}</span>
		<span class="p">}</span>
		<span class="n">gcc</span><span class="o">.</span><span class="n">nodeQueue</span><span class="o">.</span><span class="n">Done</span><span class="p">(</span><span class="n">item</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">deletedNodesNames</span><span class="p">,</span> <span class="no">false</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="gcunscheduledterminating">gcUnscheduledTerminating</h3>
<p>这种情况的处理比较简单，可以直接判断出正处于终止中但还没有被调度到任何节点的 Pods：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">gcc</span> <span class="o">*</span><span class="n">PodGCController</span><span class="p">)</span> <span class="n">gcUnscheduledTerminating</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">pods</span> <span class="p">[]</span><span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">pod</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">pods</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">pod</span><span class="o">.</span><span class="n">DeletionTimestamp</span> <span class="o">==</span> <span class="no">nil</span> <span class="o">||</span> <span class="nb">len</span><span class="p">(</span><span class="n">pod</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">NodeName</span><span class="p">)</span> <span class="o">&gt;</span> <span class="m">0</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="n">gcc</span><span class="o">.</span><span class="n">markFailedAndDeletePod</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pod</span><span class="p">)</span>  <span class="c">// 执行删除</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="gcterminating">gcTerminating</h3>
<p>该特性由 <a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown">KEP-2268</a> 引入，主要是<strong>针对 Stateful 工作负载类型</strong>的考虑。让这些工作负载可以在源 Node 关停（shutdown）或进入到一种不可恢复状态时（比如硬件、OS 故障等）能够 failover 到另外一个不同的 Node 上去。</p>

<p>在该特性引入之前，若一个 Node 的关停没有被 kubelet 的 Node Shutdown Manager 检测到，则<strong>已关停 Node 上的 kubelet 是无法删除 Pods 的</strong>，这就会导致 StatefulSet 无法创建同名的新 Pods。若这些 Pods 拥有数据卷的挂载，则这些关联的数据卷也不会从原 Node 上删除，导致这些 Pods 并不能被绑定到一个新的 Node 上。只要关停的 Node 不被恢复，这些 <strong>Pods 就会永远卡在终止中（terminating）的状态</strong>，因为只有在 Node 恢复后，这些 Pods 才会被 kubelet 删除并创建到其他 Node 上去。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">gcc</span> <span class="o">*</span><span class="n">PodGCController</span><span class="p">)</span> <span class="n">gcTerminating</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">pods</span> <span class="p">[]</span><span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">terminatingPods</span> <span class="o">:=</span> <span class="p">[]</span><span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">{}</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">pod</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">pods</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">isPodTerminating</span><span class="p">(</span><span class="n">pod</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// =&gt; pod.ObjectMeta.DeletionTimestamp != nil</span>
			<span class="n">node</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">gcc</span><span class="o">.</span><span class="n">nodeLister</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="n">pod</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">NodeName</span><span class="p">)</span>

			<span class="c">// 同时满足下列两个条件时，pod 才会被加入到 terminatingPods 列表中：</span>
			<span class="c">// 1. Node 没有 ready</span>
			<span class="c">// 2. 但是 Node 有 `node.kubernetes.io/out-of-service` 污点</span>
			<span class="k">if</span> <span class="o">!</span><span class="n">nodeutil</span><span class="o">.</span><span class="n">IsNodeReady</span><span class="p">(</span><span class="n">node</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="n">taints</span><span class="o">.</span><span class="n">TaintKeyExists</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">Taints</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">TaintNodeOutOfService</span><span class="p">)</span> <span class="p">{</span>
				<span class="n">terminatingPods</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">terminatingPods</span><span class="p">,</span> <span class="n">pod</span><span class="p">)</span>
			<span class="p">}</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="n">deleteCount</span> <span class="o">:=</span> <span class="nb">len</span><span class="p">(</span><span class="n">terminatingPods</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">deleteCount</span> <span class="o">==</span> <span class="m">0</span> <span class="p">{</span>
		<span class="k">return</span>
	<span class="p">}</span>

	<span class="n">sort</span><span class="o">.</span><span class="n">Sort</span><span class="p">(</span><span class="n">byEvictionAndCreationTimestamp</span><span class="p">(</span><span class="n">terminatingPods</span><span class="p">))</span>  <span class="c">// 按驱逐状态和 pod 创建时间戳排序</span>
	<span class="k">var</span> <span class="n">wait</span> <span class="n">sync</span><span class="o">.</span><span class="n">WaitGroup</span>
	<span class="k">for</span> <span class="n">i</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">deleteCount</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span> <span class="p">{</span>
		<span class="n">wait</span><span class="o">.</span><span class="n">Add</span><span class="p">(</span><span class="m">1</span><span class="p">)</span>
		<span class="k">go</span> <span class="k">func</span><span class="p">(</span><span class="n">pod</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">)</span> <span class="p">{</span>
			<span class="k">defer</span> <span class="n">wait</span><span class="o">.</span><span class="n">Done</span><span class="p">()</span>
			<span class="n">gcc</span><span class="o">.</span><span class="n">markFailedAndDeletePod</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pod</span><span class="p">)</span>  <span class="c">// 执行删除</span>
		<span class="p">}(</span><span class="n">terminatingPods</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
	<span class="p">}</span>
	<span class="n">wait</span><span class="o">.</span><span class="n">Wait</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>
<p>该特性要求<strong>用户手动</strong>为那些已经确定需要关停（并且短时间内不会恢复）的 Node 添加一个名为<code class="language-plaintext highlighter-rouge">node.kubernetes.io/out-of-service</code>的污点，该污点意味着 Pod 将会从 Node 上驱逐，若 Pod 不存在能容忍该污点的 toleration，则 Pod 就不会被再创建到已关停的 Node 上。</p>
<h2 id="删除过程">删除过程</h2>
<p>上述回收过程的最后，其实都调用了执行删除的函数，该函数本质上为<code class="language-plaintext highlighter-rouge">markFailedAndDeletePodWithCondition</code>。除去<code class="language-plaintext highlighter-rouge">PodDisruptionConditions</code>特性之外，就是直接使用 kube-client 删除对应的 Pod：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">gcc</span> <span class="o">*</span><span class="n">PodGCController</span><span class="p">)</span> <span class="n">markFailedAndDeletePod</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">pod</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="k">return</span> <span class="n">gcc</span><span class="o">.</span><span class="n">markFailedAndDeletePodWithCondition</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pod</span><span class="p">,</span> <span class="no">nil</span><span class="p">)</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">gcc</span> <span class="o">*</span><span class="n">PodGCController</span><span class="p">)</span> <span class="n">markFailedAndDeletePodWithCondition</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">pod</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">,</span> <span class="n">condition</span> <span class="o">*</span><span class="n">corev1apply</span><span class="o">.</span><span class="n">PodConditionApplyConfiguration</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">utilfeature</span><span class="o">.</span><span class="n">DefaultFeatureGate</span><span class="o">.</span><span class="n">Enabled</span><span class="p">(</span><span class="n">features</span><span class="o">.</span><span class="n">PodDisruptionConditions</span><span class="p">)</span> <span class="p">{</span>
		<span class="c">// 对于处于运行中阶段的 Pod，进行清楚原因的设置</span>
		<span class="k">if</span> <span class="n">pod</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">Phase</span> <span class="o">!=</span> <span class="n">v1</span><span class="o">.</span><span class="n">PodSucceeded</span> <span class="o">&amp;&amp;</span> <span class="n">pod</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">Phase</span> <span class="o">!=</span> <span class="n">v1</span><span class="o">.</span><span class="n">PodFailed</span> <span class="p">{</span>
			<span class="n">podApply</span> <span class="o">:=</span> <span class="n">corev1apply</span><span class="o">.</span><span class="n">Pod</span><span class="p">(</span><span class="n">pod</span><span class="o">.</span><span class="n">Name</span><span class="p">,</span> <span class="n">pod</span><span class="o">.</span><span class="n">Namespace</span><span class="p">)</span><span class="o">.</span><span class="n">WithStatus</span><span class="p">(</span><span class="n">corev1apply</span><span class="o">.</span><span class="n">PodStatus</span><span class="p">())</span>
			<span class="n">podApply</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">WithPhase</span><span class="p">(</span><span class="n">v1</span><span class="o">.</span><span class="n">PodFailed</span><span class="p">)</span>
			<span class="c">// 只有在 gcOrphaned 调用下该 condition 才不为 nil，传入的 condition 就是 `DelectionByPodGC`</span>
			<span class="k">if</span> <span class="n">condition</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
				<span class="n">podApply</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">WithConditions</span><span class="p">(</span><span class="n">condition</span><span class="p">)</span>
			<span class="p">}</span>
			<span class="n">gcc</span><span class="o">.</span><span class="n">kubeClient</span><span class="o">.</span><span class="n">CoreV1</span><span class="p">()</span><span class="o">.</span><span class="n">Pods</span><span class="p">(</span><span class="n">pod</span><span class="o">.</span><span class="n">Namespace</span><span class="p">)</span><span class="o">.</span><span class="n">ApplyStatus</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">podApply</span><span class="p">,</span> <span class="n">metav1</span><span class="o">.</span><span class="n">ApplyOptions</span><span class="p">{</span><span class="n">FieldManager</span><span class="o">:</span> <span class="n">fieldManager</span><span class="p">,</span> <span class="n">Force</span><span class="o">:</span> <span class="no">true</span><span class="p">})</span>  <span class="c">// =&gt; fieldManager := "PodGC"</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">gcc</span><span class="o">.</span><span class="n">kubeClient</span><span class="o">.</span><span class="n">CoreV1</span><span class="p">()</span><span class="o">.</span><span class="n">Pods</span><span class="p">(</span><span class="n">pod</span><span class="o">.</span><span class="n">Namespace</span><span class="p">)</span><span class="o">.</span><span class="n">Delete</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pod</span><span class="o">.</span><span class="n">Name</span><span class="p">,</span> <span class="o">*</span><span class="n">metav1</span><span class="o">.</span><span class="n">NewDeleteOptions</span><span class="p">(</span><span class="m">0</span><span class="p">))</span>
<span class="p">}</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">PodDisruptionConditions</code>这个特性最初是由 <a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures">KEP-3329</a> 引入，其主要<strong>目的就是为失败的 Pod 提供一个对用户更加友好的状态解释</strong>。其将 Pod 的 Disruption 状态大致分为两种，即容器/程序本身的 bug 或基础设施层面的错误。对于后者来说，其规定了<a href="https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-conditions">一系列的由基础设施引发的中断条件</a>，Pod GC 也属于其中一个（<code class="language-plaintext highlighter-rouge">DeletionByPodGC</code>）。</p>
<h2 id="reference">Reference</h2>

<ol>
  <li><a href="https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/">https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/</a></li>
  <li><a href="https://kubernetes.io/docs/concepts/workloads/pods/disruptions/">https://kubernetes.io/docs/concepts/workloads/pods/disruptions/</a></li>
  <li><a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown">https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown</a></li>
  <li><a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures">https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures</a></li>
</ol>]]></content><author><name>Your Name</name><email>shawnhxh@outlook.com</email></author><category term="post" /><category term="Kubernetes" /><summary type="html"><![CDATA[本文代码基于 Kubernetes v1.27 展开。 在 K8s 中，对于执行或调度失败的 Pods 来说，它的 API 对象还依然会存在于集群中。及时的清理掉这些对象以防止资源泄露，就变得尤其重要。K8s 中存在一个名为 Pod GC 的 controller 专门负责回收这种对象，在已终止 Pods 的数量达到 kube-controller-manager 设置的terminated-pod-gc-threshold阈值之后，Pod GC 便会开始清理工作，见gcTerminated。 另外，Pod GC 也会清理符合以下条件的任何 Pods： 是孤儿 Pods，即绑定到了一个已经不存在的 Node 上，见gcOrphaned 是未经调度过就终止的 Pods，见gcUnscheduledTerminating 是正在终止的 Pods，并绑定到了一个未 Ready 且带有node.kubernetes.io/out-of-service污点的 Node 上，见gcTerminating（启用NodeOutOfServiceVolumeDetach特性后） // pkg/controller/podgc/gc_controller.go // Pod GC controller 最终使用的方法 func (gcc *PodGCController) gc(ctx context.Context) { // 列举出当前集群中所有 pod 和 node 的资源 pods, err := gcc.podLister.List(labels.Everything()) nodes, err := gcc.nodeLister.List(labels.Everything()) if gcc.terminatedPodThreshold &gt; 0 { // 该阈值小于等于0，说明不启用 Pod GC，只进行一些其他的回收工作 gcc.gcTerminated(ctx, pods) } if utilfeature.DefaultFeatureGate.Enabled(features.NodeOutOfServiceVolumeDetach) { gcc.gcTerminating(ctx, pods) } gcc.gcOrphaned(ctx, pods, nodes) gcc.gcUnscheduledTerminating(ctx, pods) }]]></summary></entry><entry><title type="html">Raft 的成员变更与 Etcd 实现</title><link href="https://shawnh2.github.io/post/2023/06/25/raft-membership-change-and-etcd.html" rel="alternate" type="text/html" title="Raft 的成员变更与 Etcd 实现" /><published>2023-06-25T00:00:00+08:00</published><updated>2023-06-25T00:00:00+08:00</updated><id>https://shawnh2.github.io/post/2023/06/25/raft-membership-change-and-etcd</id><content type="html" xml:base="https://shawnh2.github.io/post/2023/06/25/raft-membership-change-and-etcd.html"><![CDATA[<blockquote>
  <p>本文配合 <a href="https://github.com/etcd-io/etcd/tree/release-3.4">Etcd v3.4</a> 的实现来分析 Raft 协议中有关成员变更的内容。</p>
</blockquote>

<p>集群的成员变化即是集群配置的变化。Raft 允许在一个集群不重启的前提下，自动化地对一个集群的配置进行变更。</p>
<h2 id="单成员变更">单成员变更</h2>
<h3 id="安全性">安全性</h3>
<p>对一个集群配置的变更而言，首先要考虑的就是安全性，即不破坏集群的大多数（majorities）。若在集群上每次只增加或删除一个 server，无论原始集群的个数是奇数还是偶数，一个旧集群的大多数和一个新集群的大多数必然会产生一个重叠，如下图所示。这个重叠就避免了一个集群被分离为两个大多数集群，因为它同时拥有向两端大多数的投票权，若新配置在集群中没有被复制到大多数，它的一票还是会决定集群继续使用旧配置；若新配置在集群中被复制到了大多数，它的一票就会将集群的配置切换为新配置。这种切换可以是直接切换，因为是安全的。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-25/membership-overlap.png" alt="overlap" /></p>

<!--more-->

<p>集群的配置是以一种特殊的 log entry 存储和通信的。在上述情况中，Raft 指出，server 总是使用自己 log 中记录的最新配置，无论该配置是否已经提交（committed）。即新配置往往在抵达 server 的 log 中时就开始生效，一旦新配置的 log entry 被提交（committed），就意味着新配置的变更已经完成，此时 leader 就会知道大多数节点已经采用了新配置。</p>
<h3 id="可用性">可用性</h3>
<h4 id="进度追赶">进度追赶</h4>
<p>当一个 server 在加入集群后，其不会存储任何 log entries，而在此 server 同步 log entries 期间，集群是最容易产生不可用情况的。比如，在一个由 3 台 server 组成的集群中，加入一个 server 的同时一个原有的 server 挂了，会导致集群暂时不可用。因为对于一次 log entry 的提交而言，leader 需要 3 个 follower 的提交，才认为大多数 server 接受该 log entry。但是原有的 server 挂了并且新 server 距离提交新的 log entry 又很远，所以会存在一段不可用期。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-25/progress.png" alt="progress" /></p>

<p>为了避免这段不可用期，Raft 在配置变更之前，<strong>引入了一个新的状态</strong>，即新加入的 server 不能进行投票，只能接收 leader 的日志复制。并当新 server 赶上集群的整体进度后，leader 才能决定是否进行配置变更。除此之外，leader 还需负责终止配置的变更，如果新 server 不可用（可能地址或端口配置错误）或复制进度过慢（可能永远赶不上整体进度）的话。<strong>在 etcd 的实现中，把处于此种状态的新 server 称之为 learner。</strong></p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// raft/raft.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">raft</span><span class="p">)</span> <span class="n">promotable</span><span class="p">()</span> <span class="kt">bool</span> <span class="p">{</span>
	<span class="n">pr</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">prs</span><span class="o">.</span><span class="n">Progress</span><span class="p">[</span><span class="n">r</span><span class="o">.</span><span class="n">id</span><span class="p">]</span>
	<span class="k">return</span> <span class="n">pr</span> <span class="o">!=</span> <span class="no">nil</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">pr</span><span class="o">.</span><span class="n">IsLearner</span>  <span class="c">// 处于 learner 状态的节点不能参加选举</span>
<span class="p">}</span>
</code></pre></div></div>
<p>关于 learner 如何追赶集群的整体进度，有两个点需要注意。第一，log entries 是以何种粒度从 leader 复制到 learner；第二，leader 如何判断复制到何种程度才算达到整体进度。</p>

<p>针对第一点，需要注意的是一次复制的日志不能过大，否则可能造成 leader 心跳包的拥塞，导致 election timeout 并开启新一轮的选举。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-25/large-snapshot.png" alt="large-snapshot" /></p>

<p>针对第二点，Raft 将复制到新成员的 log entries 分成了不同的轮数（rounds），如下图所示。每轮中 leader 的所有 log entries 都会被复制到 learner 上，在本轮复制期间，leader 新提交的 entries 会被放到下轮再去复制。随着复制过程的持续进行，每轮复制的时间都会变短。经过一定的轮数后，若最后一轮复制的时间比 election timeout 小，leader 才会将 learner 加入到集群中，并认为 learner 已经处于集群的整体进度了；否则，leader 会终止本次配置的变更。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-25/round.png" alt="round" /></p>

<h4 id="leader-移除">Leader 移除</h4>
<p>当 server 发生移除，并且移除的又恰好是 leader 时，可以让 leader 先切换为 follower，即将 leader 卸任，之后就和移除一个普通的 server 处理一样了。</p>

<p>Raft 指出，<strong>leader 的身份切换需要在新配置提交（committed）之后进行</strong>。如果在新配置提交之前进行，原来的 leader 很有可能再次被票选为现任 leader。以下图只有两个 server 组成的集群为例，当 leader S1 接受到新配置之后，其不应该立马切换为 follower，而是应该将该配置复制到 follower S2，然后再切换。S2 也不能成为 leader 直到它接收到 S1 的新配置。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-25/removal.png" alt="removal" /></p>

<p>从 leader 的身份成功切换，到接受新配置的 server 当选为 leader 的这段短暂不可用期，<strong>应是一个集群可以承受的</strong>。</p>
<h4 id="扰动性选举">扰动性选举</h4>
<p>当 leader 创建新配置的 log entry 后，其他<strong>没有接收到新配置的 server 不会再接收到 leader 的心跳包</strong>。由于没有接收到新的配置项，所以这些 server 是不知道自己已经被移出集群了，它们反而会产生 election timeout 并开启选举，并向其他 server 发送带有最新任期数的<code class="language-plaintext highlighter-rouge">RequestVote RPC</code>请求，现任 leader 在接收到该请求后会沦为 follower。最后，新的 leader 虽然还是从拥有新配置的那些 server 中选举出来，但该 leader server 可能已经不是原来的 leader server 了。这个过程会伴随着旧配置的那些 server 不断 timeout 然后不断的进行重新选举，造成集群整体的可用性降低。</p>
<h5 id="预投票阶段">预投票阶段</h5>
<p>Raft 尝试引入一个<strong>预投票阶段</strong>来解决上述问题，即 candidate 会首先向其他 server 询问，自己的 log 是否足够的新，以获取足够多的选票。只有 candidate 认为自己能够获取大多数 server 的投票后，才会增加任期数并开始一轮正常的选举。</p>

<p>Etcd 于 v3.4 引入预投票阶段作为实验性 feature，并于 v3.5 正式成为默认 feature。在 server 开始进行选举时，会首先切换为 PreCandidate 角色发起预投票：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// raft/raft.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">raft</span><span class="p">)</span> <span class="n">campaign</span><span class="p">(</span><span class="n">t</span> <span class="n">CampaignType</span><span class="p">)</span> <span class="p">{</span>
	<span class="c">// ...</span>
	<span class="k">var</span> <span class="n">term</span> <span class="kt">uint64</span>
	<span class="k">var</span> <span class="n">voteMsg</span> <span class="n">pb</span><span class="o">.</span><span class="n">MessageType</span>
	<span class="k">if</span> <span class="n">t</span> <span class="o">==</span> <span class="n">campaignPreElection</span> <span class="p">{</span>
		<span class="n">r</span><span class="o">.</span><span class="n">becomePreCandidate</span><span class="p">()</span>
		<span class="n">voteMsg</span> <span class="o">=</span> <span class="n">pb</span><span class="o">.</span><span class="n">MsgPreVote</span>
		<span class="n">term</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">Term</span> <span class="o">+</span> <span class="m">1</span>    <span class="c">// 它虽然以下个任期数发送，但是不是通过增加 r.Term 的方式来的</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="n">r</span><span class="o">.</span><span class="n">becomeCandidate</span><span class="p">()</span>  <span class="c">// ===&gt; r.reset(r.Term + 1)，成为正式的 candidate 后才是通过增加 r.Term 的方式来的</span>
		<span class="n">voteMsg</span> <span class="o">=</span> <span class="n">pb</span><span class="o">.</span><span class="n">MsgVote</span>
		<span class="n">term</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">Term</span>
	<span class="p">}</span>

	<span class="c">// 对于单节点的集群，成为 candidate 之后可直接成为 leader</span>
	<span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">res</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">poll</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">id</span><span class="p">,</span> <span class="n">voteRespMsgType</span><span class="p">(</span><span class="n">voteMsg</span><span class="p">),</span> <span class="no">true</span><span class="p">);</span> <span class="n">res</span> <span class="o">==</span> <span class="n">quorum</span><span class="o">.</span><span class="n">VoteWon</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">t</span> <span class="o">==</span> <span class="n">campaignPreElection</span> <span class="p">{</span>
			<span class="n">r</span><span class="o">.</span><span class="n">campaign</span><span class="p">(</span><span class="n">campaignElection</span><span class="p">)</span>
		<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
			<span class="n">r</span><span class="o">.</span><span class="n">becomeLeader</span><span class="p">()</span>
		<span class="p">}</span>
		<span class="k">return</span>
	<span class="p">}</span>
	<span class="c">// ...</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">id</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">ids</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">id</span> <span class="o">==</span> <span class="n">r</span><span class="o">.</span><span class="n">id</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="k">var</span> <span class="n">ctx</span> <span class="p">[]</span><span class="kt">byte</span>
		<span class="k">if</span> <span class="n">t</span> <span class="o">==</span> <span class="n">campaignTransfer</span> <span class="p">{</span>  <span class="c">// 记录投票原因为 leader 转移</span>
			<span class="n">ctx</span> <span class="o">=</span> <span class="p">[]</span><span class="kt">byte</span><span class="p">(</span><span class="n">t</span><span class="p">)</span>
		<span class="p">}</span>
		<span class="c">// 向所有其他除了自己之外的 server 发起投票请求</span>
		<span class="n">r</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="n">pb</span><span class="o">.</span><span class="n">Message</span><span class="p">{</span><span class="n">Term</span><span class="o">:</span> <span class="n">term</span><span class="p">,</span> <span class="n">To</span><span class="o">:</span> <span class="n">id</span><span class="p">,</span> <span class="n">Type</span><span class="o">:</span> <span class="n">voteMsg</span><span class="p">,</span> <span class="n">Index</span><span class="o">:</span> <span class="n">r</span><span class="o">.</span><span class="n">raftLog</span><span class="o">.</span><span class="n">lastIndex</span><span class="p">(),</span> <span class="n">LogTerm</span><span class="o">:</span> <span class="n">r</span><span class="o">.</span><span class="n">raftLog</span><span class="o">.</span><span class="n">lastTerm</span><span class="p">(),</span> <span class="n">Context</span><span class="o">:</span> <span class="n">ctx</span><span class="p">})</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>针对预投票请求，每个 server 在投票前都会进行各种检查，最主要的就是保证 candidate 的日志足够新：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// raft/raft.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">raft</span><span class="p">)</span> <span class="n">Step</span><span class="p">(</span><span class="n">m</span> <span class="n">pb</span><span class="o">.</span><span class="n">Message</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="c">// ...</span>

	<span class="k">switch</span> <span class="n">m</span><span class="o">.</span><span class="n">Type</span> <span class="p">{</span>
	<span class="c">// ... 针对正式投票与预投票消息</span>
	<span class="k">case</span> <span class="n">pb</span><span class="o">.</span><span class="n">MsgVote</span><span class="p">,</span> <span class="n">pb</span><span class="o">.</span><span class="n">MsgPreVote</span><span class="o">:</span>
		<span class="c">// 什么样的情况下才可以进行投票？</span>
		<span class="n">canVote</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">Vote</span> <span class="o">==</span> <span class="n">m</span><span class="o">.</span><span class="n">From</span> <span class="o">||</span>  <span class="c">// 收到了已票选对象的重复投票请求</span>
			<span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">Vote</span> <span class="o">==</span> <span class="n">None</span> <span class="o">&amp;&amp;</span> <span class="n">r</span><span class="o">.</span><span class="n">lead</span> <span class="o">==</span> <span class="n">None</span><span class="p">)</span> <span class="o">||</span>  <span class="c">// 没有投过票，并且当前任期中也不存在 leader</span>
			<span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">Type</span> <span class="o">==</span> <span class="n">pb</span><span class="o">.</span><span class="n">MsgPreVote</span> <span class="o">&amp;&amp;</span> <span class="n">m</span><span class="o">.</span><span class="n">Term</span> <span class="o">&gt;</span> <span class="n">r</span><span class="o">.</span><span class="n">Term</span><span class="p">)</span>  <span class="c">// 任期数比当前任期数大的预投票请求</span>
		<span class="c">// 无论哪种投票请求类型，都需要保证 candidate 的 log 足够的新</span>
		<span class="k">if</span> <span class="n">canVote</span> <span class="o">&amp;&amp;</span> <span class="n">r</span><span class="o">.</span><span class="n">raftLog</span><span class="o">.</span><span class="n">isUpToDate</span><span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">Index</span><span class="p">,</span> <span class="n">m</span><span class="o">.</span><span class="n">LogTerm</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">r</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="n">pb</span><span class="o">.</span><span class="n">Message</span><span class="p">{</span><span class="n">To</span><span class="o">:</span> <span class="n">m</span><span class="o">.</span><span class="n">From</span><span class="p">,</span> <span class="n">Term</span><span class="o">:</span> <span class="n">m</span><span class="o">.</span><span class="n">Term</span><span class="p">,</span> <span class="n">Type</span><span class="o">:</span> <span class="n">voteRespMsgType</span><span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">Type</span><span class="p">)})</span>  <span class="c">// 使用新任期</span>
			<span class="k">if</span> <span class="n">m</span><span class="o">.</span><span class="n">Type</span> <span class="o">==</span> <span class="n">pb</span><span class="o">.</span><span class="n">MsgVote</span> <span class="p">{</span>
				<span class="c">// election timeout 计时清零，并记录票选对象</span>
				<span class="n">r</span><span class="o">.</span><span class="n">electionElapsed</span> <span class="o">=</span> <span class="m">0</span>
				<span class="n">r</span><span class="o">.</span><span class="n">Vote</span> <span class="o">=</span> <span class="n">m</span><span class="o">.</span><span class="n">From</span>
			<span class="p">}</span>
		<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
			<span class="c">// 针对投票请求，返回拒绝投票响应</span>
			<span class="n">r</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="n">pb</span><span class="o">.</span><span class="n">Message</span><span class="p">{</span><span class="n">To</span><span class="o">:</span> <span class="n">m</span><span class="o">.</span><span class="n">From</span><span class="p">,</span> <span class="n">Term</span><span class="o">:</span> <span class="n">r</span><span class="o">.</span><span class="n">Term</span><span class="p">,</span> <span class="n">Type</span><span class="o">:</span> <span class="n">voteRespMsgType</span><span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">Type</span><span class="p">),</span> <span class="n">Reject</span><span class="o">:</span> <span class="no">true</span><span class="p">})</span>  <span class="c">// 任期不变</span>
		<span class="p">}</span>

		<span class="c">// ...</span>
<span class="p">}</span>
</code></pre></div></div>
<p><strong>但预投票并不能完全解决这个问题</strong>。如下图所示，倘若在 leader S4 复制并提交新配置 entry 之前，S1～S3 接收不到心跳包了，S1 有可能 timeout，并将含有最新任期数的投票请求发给 S4，迫使 S4 沦为 follower。此时，对于 S1 来说，预投票失效，因为它的 log 在集群大多数节点中也为新的。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-25/pre-vote.png" alt="pre-vote" /></p>
<h5 id="选举条件">选举条件</h5>
<p>针对上述问题，Raft 建议的做法是：如果一个 leader 可以在一个集群中发送心跳包，则不允许 leader 及其 followers 采纳拥有更高任期的投票请求。这种做法不仅可以避免由旧配置 server 引发的扰动性选举问题，而且还不会影响到正常的选举流程。</p>

<p>同样在 Etcd 的实现中，所有 server 对于投票或预投票的消息请求，都会先判断自身是否在一个 leader 的任期并且还在接受 leader 的心跳包：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// raft/raft.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">raft</span><span class="p">)</span> <span class="n">Step</span><span class="p">(</span><span class="n">m</span> <span class="n">pb</span><span class="o">.</span><span class="n">Message</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="k">switch</span> <span class="p">{</span>
	<span class="k">case</span> <span class="n">m</span><span class="o">.</span><span class="n">Term</span> <span class="o">==</span> <span class="m">0</span><span class="o">:</span>
		<span class="c">// local message</span>
	<span class="k">case</span> <span class="n">m</span><span class="o">.</span><span class="n">Term</span> <span class="o">&gt;</span> <span class="n">r</span><span class="o">.</span><span class="n">Term</span><span class="o">:</span>
		<span class="k">if</span> <span class="n">m</span><span class="o">.</span><span class="n">Type</span> <span class="o">==</span> <span class="n">pb</span><span class="o">.</span><span class="n">MsgVote</span> <span class="o">||</span> <span class="n">m</span><span class="o">.</span><span class="n">Type</span> <span class="o">==</span> <span class="n">pb</span><span class="o">.</span><span class="n">MsgPreVote</span> <span class="p">{</span>
			<span class="n">force</span> <span class="o">:=</span> <span class="n">bytes</span><span class="o">.</span><span class="n">Equal</span><span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="p">[]</span><span class="kt">byte</span><span class="p">(</span><span class="n">campaignTransfer</span><span class="p">))</span>  <span class="c">// 投票原因是否为 leader 转移</span>
			<span class="n">inLease</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">checkQuorum</span> <span class="o">&amp;&amp;</span> <span class="n">r</span><span class="o">.</span><span class="n">lead</span> <span class="o">!=</span> <span class="n">None</span> <span class="o">&amp;&amp;</span> <span class="n">r</span><span class="o">.</span><span class="n">electionElapsed</span> <span class="o">&lt;</span> <span class="n">r</span><span class="o">.</span><span class="n">electionTimeout</span>  <span class="c">// 没有产生 election timeout，说明在正常接收/发送心跳</span>
			<span class="k">if</span> <span class="o">!</span><span class="n">force</span> <span class="o">&amp;&amp;</span> <span class="n">inLease</span> <span class="p">{</span>
                        <span class="c">// 对于非 leader 转移并且在一个正常的任期内接收心跳，此时直接返回，不进行投票</span>
				<span class="k">return</span> <span class="no">nil</span>
			<span class="p">}</span>
		<span class="p">}</span>
        <span class="c">// ...</span>
    <span class="p">}</span>
    <span class="c">// ...</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="多成员变更">多成员变更</h2>
<p>多成员的变更虽然可以处理为多次单成员的变更，但在实际的场景中，这种做法可能并不实用。</p>

<p>与单成员变更不同的是，在多成员变更中，集群节点<strong>不可能立即从旧配置切换到新的配置</strong>，因为有关新、旧配置 overlap 的约束已经不成立了。这就意味着，整个集群肯定存在某个时刻，被新、旧配置分离（disjoint）为了两个 majorities。比如下图 Server 1～2 属于一个 majority，Server 3～5 属于另一个 majority。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-25/two-major.png" alt="two-major" /></p>
<h3 id="联合一致性">联合一致性</h3>
<p>为了保证任意成员/配置变更的安全性，Raft 会首先将集群的配置切换为一种过渡配置，即联合一致性（joint consensus）。一旦联合一致性被提交（committed），集群才会过度到新配置。联合一致性共同包含了新、旧两种配置：</p>

<ul>
  <li>这种联合配置的 log entries 会被复制到 server 中</li>
  <li>任意一个包含这种配置的 server 都有可能被选举为 leader</li>
  <li>选举和 entry 的提交需要来自两个不同 majorities 的投票。例如，当一个集群的节点个数由 3 个增加到 9 个时，旧配置中 3 个 servers 的 2 个，以及新配置中 9 个 servers 的 5 个，都需要获取同意</li>
</ul>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-25/joint-consensus.png" alt="joint-consensus" /></p>

<p>当一个 leader 接收到需要将旧配置变更为新配置的请求时，它会将联合配置作为一个 log entry 存储起来，并复制给 followers。与单节点变更的日志复制相同，follower 在接收到联合配置后便立即生效。如果此时 leader 挂了，新选举出的 leader <strong>可能也只可能属于旧配置或者联合配置</strong>（这取决于它是否接收到了联合配置）。一旦联合配置被提交，leader 便开始创建新配置的 log entry 并复制到集群，server 接收到的新配置也是立马生效。当新配置被提交之后，那些不属于新配置的 server 会被关停。</p>

<p><strong>这种配置变更方式属于两阶段提交</strong>。如上图所示，集群中<strong>不存在</strong>任意一个时刻，新配置和旧配置同时参与决策。</p>

<p>在 Etcd 中，其实<strong>并没有实现</strong>多成员配置变更的这种情况，它还是每次只变更一个成员。与 Raft 不同的是，Etcd 中成员配置变更的生效时刻<strong>不是在</strong>配置的 entry 加入到 log 之后，而是在该 entry 被提交之后。</p>

<h2 id="reference">Reference</h2>

<ol>
  <li><a href="https://github.com/ongardie/dissertation/blob/master/stanford.pdf">https://github.com/ongardie/dissertation/blob/master/stanford.pdf</a></li>
  <li><a href="https://github.com/etcd-io/etcd/blob/release-3.4/Documentation/learning/design-learner.md">https://github.com/etcd-io/etcd/blob/release-3.4/Documentation/learning/design-learner.md</a></li>
  <li><a href="https://github.com/etcd-io/etcd/blob/release-3.4/raft/README.md">https://github.com/etcd-io/etcd/blob/release-3.4/raft/README.md</a></li>
  <li><a href="https://github.com/etcd-io/etcd/blob/release-3.4/raft/design.md">https://github.com/etcd-io/etcd/blob/release-3.4/raft/design.md</a></li>
  <li><a href="https://kubernetes.io/blog/2019/08/30/announcing-etcd-3-4/">https://kubernetes.io/blog/2019/08/30/announcing-etcd-3-4/</a></li>
</ol>]]></content><author><name>Your Name</name><email>shawnhxh@outlook.com</email></author><category term="post" /><category term="Distributed System" /><summary type="html"><![CDATA[本文配合 Etcd v3.4 的实现来分析 Raft 协议中有关成员变更的内容。 集群的成员变化即是集群配置的变化。Raft 允许在一个集群不重启的前提下，自动化地对一个集群的配置进行变更。 单成员变更 安全性 对一个集群配置的变更而言，首先要考虑的就是安全性，即不破坏集群的大多数（majorities）。若在集群上每次只增加或删除一个 server，无论原始集群的个数是奇数还是偶数，一个旧集群的大多数和一个新集群的大多数必然会产生一个重叠，如下图所示。这个重叠就避免了一个集群被分离为两个大多数集群，因为它同时拥有向两端大多数的投票权，若新配置在集群中没有被复制到大多数，它的一票还是会决定集群继续使用旧配置；若新配置在集群中被复制到了大多数，它的一票就会将集群的配置切换为新配置。这种切换可以是直接切换，因为是安全的。]]></summary></entry><entry><title type="html">MetalLB 工作原理解析</title><link href="https://shawnh2.github.io/post/2023/06/06/metallb-walk-through.html" rel="alternate" type="text/html" title="MetalLB 工作原理解析" /><published>2023-06-06T00:00:00+08:00</published><updated>2023-06-06T00:00:00+08:00</updated><id>https://shawnh2.github.io/post/2023/06/06/metallb-walk-through</id><content type="html" xml:base="https://shawnh2.github.io/post/2023/06/06/metallb-walk-through.html"><![CDATA[<blockquote>
  <p>本文代码基于 <a href="https://github.com/metallb/metallb/tree/v0.13.9">MetalLB v0.13.9</a> 展开。</p>
</blockquote>

<p>MetalLB 是一个基于标准路由协议的，用于裸机（bare-metal）k8s 集群的负载均衡器。这里裸机是指，直接部署的 k8s 集群并不能使用 LoadBalancer 类型的 Service，因为它没有提供一种负载均衡器的实现，只有在一些云服务 IaaS 平台（例如 AWS、GCP 等）上才能使用。</p>

<p>MetalLB 从两个方面实现了这么一个负载均衡器：<strong>地址分配</strong>（Address Allocation）和<strong>外部广播</strong>（External Announcement）。</p>

<h2 id="地址分配">地址分配</h2>
<p>类似于各种云厂商的实现，对每个向负载均衡器的请求分配 IP 地址。MetalLB 则负责在裸机集群中分配 IP 地址，这个 IP 地址是从预先配置的地址池（AddressPool）中获取的；同样当 Service 被删除后，MetalLB 也负责回收该地址。</p>

<h3 id="核心方法">核心方法</h3>
<h4 id="reconcileservice">reconcileService</h4>
<p>此方法是 service-controller 的调协方法，位于 MetalLB 的 controller 组件中，负责监听<strong>所有类型</strong>的 Service，然后对它们的 IP 地址进行管理（分配或回收）。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/k8s/controllers/service_controller.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">ServiceReconciler</span><span class="p">)</span> <span class="n">reconcileService</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">req</span> <span class="n">ctrl</span><span class="o">.</span><span class="n">Request</span><span class="p">)</span> <span class="p">(</span><span class="n">ctrl</span><span class="o">.</span><span class="n">Result</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="c">// ...</span>
	<span class="k">var</span> <span class="n">service</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span>

	<span class="c">// 根据 Endpoint 提供的 NamespacedName 对象寻找对应的 Service 对象</span>
	<span class="n">service</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">serviceFor</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="o">.</span><span class="n">NamespacedName</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>					<span class="err">\</span>
		<span class="k">return</span> <span class="n">ctrl</span><span class="o">.</span><span class="n">Result</span><span class="p">{},</span> <span class="n">err</span>		 <span class="err">\</span>
	<span class="p">}</span>						  <span class="o">--&gt;--</span> <span class="n">r</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">res</span><span class="p">)</span>

        <span class="c">// 若 MetalLB 的配置文件中指定了 LoadBalancerClass，则比对它和 Service 的是否一致</span>
        <span class="c">// 只有一致或无指定配置时才可通过，默认情况下，配置文件不指定该字段</span>
	<span class="k">if</span> <span class="n">filterByLoadBalancerClass</span><span class="p">(</span><span class="n">service</span><span class="p">,</span> <span class="n">r</span><span class="o">.</span><span class="n">LoadBalancerClass</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">ctrl</span><span class="o">.</span><span class="n">Result</span><span class="p">{},</span> <span class="no">nil</span>
	<span class="p">}</span>

	<span class="c">// 根据 Service 获取其所代理的 Endpoints 或 EndpointSlice</span>
	<span class="n">epSlices</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">epsOrSlicesForServices</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">req</span><span class="o">.</span><span class="n">NamespacedName</span><span class="p">,</span> <span class="n">r</span><span class="o">.</span><span class="n">Endpoints</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">ctrl</span><span class="o">.</span><span class="n">Result</span><span class="p">{},</span> <span class="n">err</span>
	<span class="p">}</span>
	<span class="c">// 此时根据 Service 是否为空，可以判断出此次调谐是对 Service 的删除还是更新</span>

	<span class="c">// 对 Service 进行处理，包括 IP 地址的分配和回收</span>
	<span class="n">res</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">Handler</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">Logger</span><span class="p">,</span> <span class="n">req</span><span class="o">.</span><span class="n">NamespacedName</span><span class="o">.</span><span class="n">String</span><span class="p">(),</span> <span class="n">service</span><span class="p">,</span> <span class="n">epSlices</span><span class="p">)</span>
	<span class="k">switch</span> <span class="n">res</span> <span class="p">{</span>
	<span class="k">case</span> <span class="n">SyncStateError</span><span class="o">:</span>
		<span class="k">return</span> <span class="n">ctrl</span><span class="o">.</span><span class="n">Result</span><span class="p">{},</span> <span class="n">retryError</span>
	<span class="k">case</span> <span class="n">SyncStateReprocessAll</span><span class="o">:</span>
		<span class="c">// 重新进行全量的调谐</span>
		<span class="n">r</span><span class="o">.</span><span class="n">forceReload</span><span class="p">()</span>
		<span class="k">return</span> <span class="n">ctrl</span><span class="o">.</span><span class="n">Result</span><span class="p">{},</span> <span class="no">nil</span>
	<span class="k">case</span> <span class="n">SyncStateErrorNoRetry</span><span class="o">:</span>
		<span class="k">return</span> <span class="n">ctrl</span><span class="o">.</span><span class="n">Result</span><span class="p">{},</span> <span class="no">nil</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">ctrl</span><span class="o">.</span><span class="n">Result</span><span class="p">{},</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<!--more-->

<p>Service Controller 调谐所使用的更新数据是一个<code class="language-plaintext highlighter-rouge">ctrl.Request</code>类型的更新请求，这个更新请求是跟随 MetalLB controller 组件中 manager 的第一个<code class="language-plaintext highlighter-rouge">Watches</code>方法创建的，此方法监听所有 Service 类型的资源，并提取其所代理 Endpoints 的命名空间和名字，形成一个内容为<code class="language-plaintext highlighter-rouge">NamespacedName</code>的<code class="language-plaintext highlighter-rouge">ctrl.Request</code>更新请求。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ctrl</span><span class="o">.</span><span class="n">NewControllerManagedBy</span><span class="p">(</span><span class="n">mgr</span><span class="p">)</span><span class="o">.</span>
    <span class="n">For</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span><span class="p">{})</span><span class="o">.</span>
    <span class="n">Watches</span><span class="p">(</span><span class="o">&amp;</span><span class="n">source</span><span class="o">.</span><span class="n">Kind</span><span class="p">{</span><span class="n">Type</span><span class="o">:</span> <span class="o">&amp;</span><span class="n">v1</span><span class="o">.</span><span class="n">Endpoints</span><span class="p">{}},</span>
        <span class="n">handler</span><span class="o">.</span><span class="n">EnqueueRequestsFromMapFunc</span><span class="p">(</span><span class="k">func</span><span class="p">(</span><span class="n">obj</span> <span class="n">client</span><span class="o">.</span><span class="n">Object</span><span class="p">)</span> <span class="p">[]</span><span class="n">reconcile</span><span class="o">.</span><span class="n">Request</span> <span class="p">{</span>
            <span class="n">endpoints</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">obj</span><span class="o">.</span><span class="p">(</span><span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Endpoints</span><span class="p">)</span>
            <span class="k">if</span> <span class="o">!</span><span class="n">ok</span> <span class="p">{</span>
                <span class="k">return</span> <span class="p">[]</span><span class="n">reconcile</span><span class="o">.</span><span class="n">Request</span><span class="p">{}</span>
            <span class="p">}</span>
            <span class="n">name</span> <span class="o">:=</span> <span class="n">types</span><span class="o">.</span><span class="n">NamespacedName</span><span class="p">{</span><span class="n">Name</span><span class="o">:</span> <span class="n">endpoints</span><span class="o">.</span><span class="n">Name</span><span class="p">,</span> <span class="n">Namespace</span><span class="o">:</span> <span class="n">endpoints</span><span class="o">.</span><span class="n">Namespace</span><span class="p">}</span>
            <span class="k">return</span> <span class="p">[]</span><span class="n">reconcile</span><span class="o">.</span><span class="n">Request</span>
        <span class="p">}))</span><span class="o">.</span>
    <span class="n">Watches</span><span class="p">(</span><span class="o">&amp;</span><span class="n">source</span><span class="o">.</span><span class="n">Channel</span><span class="p">{</span><span class="n">Source</span><span class="o">:</span> <span class="n">r</span><span class="o">.</span><span class="n">Reload</span><span class="p">},</span> <span class="o">&amp;</span><span class="n">handler</span><span class="o">.</span><span class="n">EnqueueRequestForObject</span><span class="p">{})</span><span class="o">.</span>
    <span class="n">Complete</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
</code></pre></div></div>
<p>不难发现，除了第一个<code class="language-plaintext highlighter-rouge">Watches</code>方法的资源监控，Service Controller 还注册了第二个<code class="language-plaintext highlighter-rouge">Watches</code>方法：即监听所有 Reload 事件。Reload 事件即全量的对 Service 进行调谐（与上述<code class="language-plaintext highlighter-rouge">r.forceReload()</code>相同），这里监听<code class="language-plaintext highlighter-rouge">Reload</code>通道是为了方便在代码其他逻辑中可以触发全量调谐。除此之外，将第一个<code class="language-plaintext highlighter-rouge">Watches</code>方法监听到的资源也转换为了一个更新请求，同样也是<strong>为了整个调谐方法逻辑处理的方便性</strong>。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-06/metallb-reconcile.png" alt="metallb-reconcile" /></p>

<p>在 Service Controller 的实际调谐循环中，根据更新请求的类型来决定实际调谐的类型。另外，是否进行全量调谐，可通过<code class="language-plaintext highlighter-rouge">ctrl.Request</code>中特殊的<code class="language-plaintext highlighter-rouge">NamespacedName</code>值进行判断：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">ServiceReconciler</span><span class="p">)</span> <span class="n">Reconcile</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">req</span> <span class="n">ctrl</span><span class="o">.</span><span class="n">Request</span><span class="p">)</span> <span class="p">(</span><span class="n">ctrl</span><span class="o">.</span><span class="n">Result</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">if</span> <span class="o">!</span><span class="n">isReloadReq</span><span class="p">(</span><span class="n">req</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">r</span><span class="o">.</span><span class="n">reconcileService</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">r</span><span class="o">.</span><span class="n">reprocessAll</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span>
<span class="p">}</span>

<span class="c">// internal/k8s/controllers/service_controller_reload.go</span>
<span class="k">func</span> <span class="n">isReloadReq</span><span class="p">(</span><span class="n">req</span> <span class="n">ctrl</span><span class="o">.</span><span class="n">Request</span><span class="p">)</span> <span class="kt">bool</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">req</span><span class="o">.</span><span class="n">Name</span> <span class="o">==</span> <span class="s">"reload"</span> <span class="o">&amp;&amp;</span> <span class="n">req</span><span class="o">.</span><span class="n">Namespace</span> <span class="o">==</span> <span class="s">"metallbreload"</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">true</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="no">false</span>
<span class="p">}</span>
</code></pre></div></div>
<p>全量调谐<code class="language-plaintext highlighter-rouge">reprocessAll</code>的实现其实就是把<code class="language-plaintext highlighter-rouge">reconcileService</code>调谐逻辑中对资源<code class="language-plaintext highlighter-rouge">Get</code>的方法替换为了<code class="language-plaintext highlighter-rouge">List</code>方法，但对于每个单独 Service 的处理逻辑不变。</p>
<h4 id="setbalancer">SetBalancer</h4>
<p>此方法就是在 Service Controller 的<code class="language-plaintext highlighter-rouge">reconcileService</code>调谐中使用的<code class="language-plaintext highlighter-rouge">r.Handler</code>方法，是用于处理 Service 类型资源发生更新时的方法。其大致流程为：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// controller/main.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">controller</span><span class="p">)</span> <span class="n">SetBalancer</span><span class="p">(</span><span class="n">l</span> <span class="n">log</span><span class="o">.</span><span class="n">Logger</span><span class="p">,</span> <span class="n">name</span> <span class="kt">string</span><span class="p">,</span> <span class="n">svcRo</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span><span class="p">,</span> <span class="n">_</span> <span class="n">epslices</span><span class="o">.</span><span class="n">EpsOrSlices</span><span class="p">)</span> <span class="n">controllers</span><span class="o">.</span><span class="n">SyncState</span> <span class="p">{</span>
	<span class="c">// 对于空的 Service 即触发回收 IP 操作</span>
	<span class="k">if</span> <span class="n">svcRo</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>  <span class="c">// Read only</span>
		<span class="n">c</span><span class="o">.</span><span class="n">deleteBalancer</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">name</span><span class="p">)</span> <span class="o">---&gt;---</span>
                                                  <span class="err">\</span>
                                                   <span class="err">\</span>
                                                    <span class="n">c</span><span class="o">.</span><span class="n">ips</span><span class="o">.</span><span class="n">Unassign</span><span class="p">(</span><span class="n">name</span><span class="p">)</span>

                <span class="c">// 触发后进行全量调谐，因为可能存在其他 LB 类型的 Service 在等待 IP 地址的分配</span>
		<span class="k">return</span> <span class="n">controllers</span><span class="o">.</span><span class="n">SyncStateReprocessAll</span>
	<span class="p">}</span>

	<span class="c">// 在分配 IP 地址之前，先确保地址池是配置过的</span>
	<span class="k">if</span> <span class="n">c</span><span class="o">.</span><span class="n">pools</span> <span class="o">==</span> <span class="no">nil</span> <span class="o">||</span> <span class="n">c</span><span class="o">.</span><span class="n">pools</span><span class="o">.</span><span class="n">ByName</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">controllers</span><span class="o">.</span><span class="n">SyncStateSuccess</span>
	<span class="p">}</span>

	<span class="n">svc</span> <span class="o">:=</span> <span class="n">svcRo</span><span class="o">.</span><span class="n">DeepCopy</span><span class="p">()</span>
	<span class="n">successRes</span> <span class="o">:=</span> <span class="n">controllers</span><span class="o">.</span><span class="n">SyncStateSuccess</span>
	<span class="c">// 检查该服务是否被分配过 IP 地址</span>
	<span class="n">wasAllocated</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">isServiceAllocated</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="o">---&gt;---</span>
        <span class="c">// 获取与分配 IP                                    \</span>
	<span class="n">c</span><span class="o">.</span><span class="n">convergeBalancer</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">svc</span><span class="p">)</span>                    <span class="err">\</span>
                                                    	 <span class="n">c</span><span class="o">.</span><span class="n">ips</span><span class="o">.</span><span class="n">Pool</span><span class="p">(</span><span class="n">key</span><span class="p">)</span> <span class="o">!=</span> <span class="s">""</span>

        <span class="c">// convergeBalancer 可能会取消对 Service 的 IP 分配，若此种情况发生</span>
	<span class="k">if</span> <span class="n">wasAllocated</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">c</span><span class="o">.</span><span class="n">isServiceAllocated</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="p">{</span>
		<span class="c">// 被回收的 IP 地址可能还会被其他 LB 类型的 Service 使用，所以再进行全量调谐</span>
		<span class="n">successRes</span> <span class="o">=</span> <span class="n">controllers</span><span class="o">.</span><span class="n">SyncStateReprocessAll</span>
	<span class="p">}</span>

	<span class="c">// 对于没有发生任何变化的 Service，则直接返回</span>
	<span class="k">if</span> <span class="n">reflect</span><span class="o">.</span><span class="n">DeepEqual</span><span class="p">(</span><span class="n">svcRo</span><span class="p">,</span> <span class="n">svc</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">successRes</span>
	<span class="p">}</span>

	<span class="n">toWrite</span> <span class="o">:=</span> <span class="n">svcRo</span><span class="o">.</span><span class="n">DeepCopy</span><span class="p">()</span>
	<span class="c">// 最后再次与 svcRo 的 Status 字段进行比对，发生变化了则直接进行更新；因为 svc 在 convergeBalancer 中可能会发生变化</span>
	<span class="k">if</span> <span class="o">!</span><span class="n">reflect</span><span class="o">.</span><span class="n">DeepEqual</span><span class="p">(</span><span class="n">svcRo</span><span class="o">.</span><span class="n">Status</span><span class="p">,</span> <span class="n">svc</span><span class="o">.</span><span class="n">Status</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">toWrite</span><span class="o">.</span><span class="n">Status</span> <span class="o">=</span> <span class="n">svc</span><span class="o">.</span><span class="n">Status</span>
	<span class="p">}</span>
	<span class="c">// Annotations 字段也是，发生变化了则直接进行更新</span>
	<span class="k">if</span> <span class="o">!</span><span class="n">reflect</span><span class="o">.</span><span class="n">DeepEqual</span><span class="p">(</span><span class="n">svcRo</span><span class="o">.</span><span class="n">Annotations</span><span class="p">,</span> <span class="n">svc</span><span class="o">.</span><span class="n">Annotations</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">toWrite</span><span class="o">.</span><span class="n">Annotations</span> <span class="o">=</span> <span class="n">svc</span><span class="o">.</span><span class="n">Annotations</span>
	<span class="p">}</span>
	<span class="c">// 只有上述两个字段发生了更新时，才会引发这两者的不同，进而才会进行更新</span>
	<span class="k">if</span> <span class="o">!</span><span class="n">reflect</span><span class="o">.</span><span class="n">DeepEqual</span><span class="p">(</span><span class="n">toWrite</span><span class="p">,</span> <span class="n">svcRo</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">UpdateStatus</span><span class="p">(</span><span class="n">svc</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span> <span class="n">controllers</span><span class="o">.</span><span class="n">SyncStateError</span>
		<span class="p">}</span>
		<span class="k">return</span> <span class="n">successRes</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">successRes</span>
<span class="p">}</span>
</code></pre></div></div>
<p>可以发现，MetalLB 对 Service 资源发生的变动集中在其<code class="language-plaintext highlighter-rouge">Status</code>和<code class="language-plaintext highlighter-rouge">Annotations</code>字段，其中被分配的 IP 会被写入到 Service 的<code class="language-plaintext highlighter-rouge">Status</code>字段中，具体来说是<code class="language-plaintext highlighter-rouge">status.loadBalancer.ingress.ip</code>，这也正是 k8s 期望发生的行为。</p>
<h4 id="convergebalancer">convergeBalancer</h4>
<p>该方法在<code class="language-plaintext highlighter-rouge">SetBalancer</code>中被调用，是 Service Controller 用于 IP 地址分配的核心方法，也是整个 MetalLB 地址分配过程的核心方法。其所涉及的 IP 分配过程如下，由于方法过长，分段进行说明：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// controller/service.go</span>

<span class="c">// #1</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">controller</span><span class="p">)</span> <span class="n">convergeBalancer</span><span class="p">(</span><span class="n">l</span> <span class="n">log</span><span class="o">.</span><span class="n">Logger</span><span class="p">,</span> <span class="n">key</span> <span class="kt">string</span><span class="p">,</span> <span class="n">svc</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">lbIPs</span> <span class="o">:=</span> <span class="p">[]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">{}</span>
	<span class="k">var</span> <span class="n">err</span> <span class="kt">error</span>

	<span class="c">// 对于非 LoadBalancer 类型的 Service，可提前返回；同时还清除了 Service 的状态信息</span>
	<span class="k">if</span> <span class="n">svc</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">Type</span> <span class="o">!=</span> <span class="n">v1</span><span class="o">.</span><span class="n">ServiceTypeLoadBalancer</span> <span class="p">{</span>
		<span class="n">c</span><span class="o">.</span><span class="n">clearServiceState</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">)</span> <span class="o">---&gt;---</span>
		<span class="k">return</span>				      <span class="err">\</span>
	<span class="p">}</span>					       <span class="err">\</span>
                                                        <span class="err">\</span>
                                                        <span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">controller</span><span class="p">)</span> <span class="n">clearServiceState</span><span class="p">(</span><span class="n">key</span> <span class="kt">string</span><span class="p">,</span> <span class="n">svc</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span><span class="p">)</span> <span class="p">{</span>
                                                            <span class="n">c</span><span class="o">.</span><span class="n">ips</span><span class="o">.</span><span class="n">Unassign</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
                                                            <span class="nb">delete</span><span class="p">(</span><span class="n">svc</span><span class="o">.</span><span class="n">Annotations</span><span class="p">,</span> <span class="n">annotationIPAllocateFromPool</span><span class="p">)</span>  <span class="c">// =&gt; "metallb.universe.tf/ip-allocated-from-pool"</span>
                                                            <span class="n">svc</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">LoadBalancer</span> <span class="o">=</span> <span class="n">v1</span><span class="o">.</span><span class="n">LoadBalancerStatus</span><span class="p">{}</span>
                                                          <span class="p">}</span>

        <span class="c">// MetalLB 会根据 ClusterIP 的类型来决定使用的地址族，故对于没有 ClusterIP 的 Service 则直接返回</span>
	<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">svc</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">ClusterIPs</span><span class="p">)</span> <span class="o">==</span> <span class="m">0</span> <span class="o">&amp;&amp;</span> <span class="n">svc</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">ClusterIP</span> <span class="o">==</span> <span class="s">""</span> <span class="p">{</span>
		<span class="n">c</span><span class="o">.</span><span class="n">clearServiceState</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">)</span>
		<span class="k">return</span>
	<span class="p">}</span>

	<span class="c">// ...</span>
</code></pre></div></div>
<p>从上述过程来看，可以很好的诠释：为什么不在<code class="language-plaintext highlighter-rouge">SetBalancer</code>中就把 LoadBalancer 类型的 Service 筛选出来然后直接对它们进行 IP 分配？因为如果这样做的话，是只考虑了分配过程，而没有考虑回收。若直接对 LoadBalancer 类型的 Service 操作，则对于原来是 LoadBalancer 类型而现在是其他非 LoadBalancer 类型的 Service，它已被分配的 LB IP 就不能被回收，造成地址的无效占用。所以在此方法中进行筛选，并同时清除非 LoadBalancer 类型 Service 的 LB IP，以做到地址的回收。</p>

<p>另外，可以发现 MetalLB 对于 LoadBalancer 类型的 Headless Service 而言是无效的，<strong>这一点是合理的</strong>。因为对于没有 ClusterIP 的 Service 来说，LoadBalancer 类型是没有意义的，负载均衡器不会将流量转发到任何 Service 所代理的 Pods 上。对于这种情况，倒是可以使用 Ingress Gateway 将每个 Pod 对应到一个 Endpoint 上，从而对外公开服务。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// #2</span>
	<span class="c">// 获取所有在 Status 中 Ingress 字段出现的 IP 地址</span>
	<span class="k">for</span> <span class="n">i</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">svc</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">LoadBalancer</span><span class="o">.</span><span class="n">Ingress</span> <span class="p">{</span>
		<span class="n">ip</span> <span class="o">:=</span> <span class="n">svc</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">LoadBalancer</span><span class="o">.</span><span class="n">Ingress</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">IP</span>
		<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">ip</span><span class="p">)</span> <span class="o">!=</span> <span class="m">0</span> <span class="p">{</span>
			<span class="n">lbIPs</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">lbIPs</span><span class="p">,</span> <span class="n">net</span><span class="o">.</span><span class="n">ParseIP</span><span class="p">(</span><span class="n">ip</span><span class="p">))</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="c">// 若 IP 地址为空，或是所有 IP 地址的解析都不正确，则会清除当前 Service 的状态</span>
	<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">lbIPs</span><span class="p">)</span> <span class="o">==</span> <span class="m">0</span> <span class="p">{</span>
		<span class="n">c</span><span class="o">.</span><span class="n">clearServiceState</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">)</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="c">// 确定当前 LB IP 的 IP 地址家族</span>
		<span class="n">lbIPsIPFamily</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">ipfamily</span><span class="o">.</span><span class="n">ForAddressesIPs</span><span class="p">(</span><span class="n">lbIPs</span><span class="p">)</span>
		<span class="c">// 确定 ClusterIP 的 IP 地址家族</span>
		<span class="n">clusterIPsIPFamily</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">ipfamily</span><span class="o">.</span><span class="n">ForService</span><span class="p">(</span><span class="n">svc</span><span class="p">)</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span>
		<span class="p">}</span>
		<span class="c">// 若 LB IP 和 ClsuterIP 的 IP 地址家族不一致，则非有效的 IP 地址</span>
		<span class="k">if</span> <span class="n">lbIPsIPFamily</span> <span class="o">!=</span> <span class="n">clusterIPsIPFamily</span> <span class="o">||</span> <span class="n">lbIPsIPFamily</span> <span class="o">==</span> <span class="n">ipfamily</span><span class="o">.</span><span class="n">Unknown</span> <span class="p">{</span>
			<span class="n">c</span><span class="o">.</span><span class="n">clearServiceState</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">)</span>
			<span class="n">lbIPs</span> <span class="o">=</span> <span class="p">[]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">{}</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="c">// ...</span>
</code></pre></div></div>
<p>注意，MetalLB 在处理<code class="language-plaintext highlighter-rouge">status.loadBalancer.ingress</code>字段的 IP 地址时，并没有肯定该字段记录的所有 IP 地址都是有效的。即不排除任何程序或用户对该字段值做出修改的可能，MetalLB 会对这些 IP 地址重新过一遍解析，保证 IP 地址的合法性。之后也保证了 LB IP 与 ClusterIP 的 IP 地址家族是一致的情况下，这些 IP 才是生效的（生效但并非有效）。</p>

<p>这里获取两者 IP 地址家族的函数，本质上都调用的是 <a href="https://github.com/metallb/metallb/blob/4b41fd5175f4a4329f532dda2b456832188d63fc/internal/ipfamily/ipfamily.go#L27">ForAddresses</a>，即对于只有一个 IP 的地址，根据其是 ipv4 还是 ipv6 类型来确定地址家族；而对于有两个 IP 的地址，两者只有在 IP 类型都不同的情况下，才可以确定使用 dual stack，否则对于相同的地址类型则返回错误。这也说明了 MetalLB <strong>最多只能</strong>给每个 LoadBalancer 类型的 Service 分配两个不同类型的 IP。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// #3</span>
	<span class="c">// 对于现有的 LB IP，它们可能随着配置的更该而不再适用，所以需要再次进行检查并提供再次分配 LB IP 的机会</span>
	<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">lbIPs</span><span class="p">)</span> <span class="o">!=</span> <span class="m">0</span> <span class="p">{</span>
		<span class="c">// 地址分配的操作是幂等的，详细说明见下节内容</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">ips</span><span class="o">.</span><span class="n">Assign</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">,</span> <span class="n">lbIPs</span><span class="p">,</span> <span class="n">k8salloc</span><span class="o">.</span><span class="n">Ports</span><span class="p">(</span><span class="n">svc</span><span class="p">),</span> <span class="n">k8salloc</span><span class="o">.</span><span class="n">SharingKey</span><span class="p">(</span><span class="n">svc</span><span class="p">),</span> <span class="n">k8salloc</span><span class="o">.</span><span class="n">BackendKey</span><span class="p">(</span><span class="n">svc</span><span class="p">));</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">c</span><span class="o">.</span><span class="n">clearServiceState</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">)</span>
			<span class="n">lbIPs</span> <span class="o">=</span> <span class="p">[]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">{}</span>
		<span class="p">}</span>
		<span class="c">// 对于地址池 annotation 被修改的情况，意味着需要使用一个新的地址池进行地址分配</span>
		<span class="n">desiredPool</span> <span class="o">:=</span> <span class="n">svc</span><span class="o">.</span><span class="n">Annotations</span><span class="p">[</span><span class="n">annotationAddressPool</span><span class="p">]</span>  <span class="c">// =&gt; "metallb.universe.tf/address-pool"</span>
		<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">lbIPs</span><span class="p">)</span> <span class="o">!=</span> <span class="m">0</span> <span class="o">&amp;&amp;</span> <span class="n">desiredPool</span> <span class="o">!=</span> <span class="s">""</span> <span class="o">&amp;&amp;</span> <span class="n">c</span><span class="o">.</span><span class="n">ips</span><span class="o">.</span><span class="n">Pool</span><span class="p">(</span><span class="n">key</span><span class="p">)</span> <span class="o">!=</span> <span class="n">desiredPool</span> <span class="p">{</span>
			<span class="n">c</span><span class="o">.</span><span class="n">clearServiceState</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">)</span>
			<span class="n">lbIPs</span> <span class="o">=</span> <span class="p">[]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">{}</span>
		<span class="p">}</span>
		<span class="c">// 获取期望的 LB IP</span>
		<span class="n">desiredLbIPs</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">getDesiredLbIPs</span><span class="p">(</span><span class="n">svc</span><span class="p">)</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span>
		<span class="p">}</span>
		<span class="c">// 若存在期望的 LB IP，且当前 LB IP 与期望的 LB IP 不同，则清空现有状态</span>
		<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">desiredLbIPs</span><span class="p">)</span> <span class="o">&gt;</span> <span class="m">0</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">isEqualIPs</span><span class="p">(</span><span class="n">lbIPs</span><span class="p">,</span> <span class="n">desiredLbIPs</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">c</span><span class="o">.</span><span class="n">clearServiceState</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">)</span>
			<span class="n">lbIPs</span> <span class="o">=</span> <span class="p">[]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">{}</span>
		<span class="p">}</span>
	<span class="p">}</span>

    <span class="c">// ...</span>
</code></pre></div></div>
<p>之前检查完 IP 地址的合法性，现在就需要根据配置来检查其有效性。这里涉及一个获取期望 LB IP 的函数：<a href="https://github.com/metallb/metallb/blob/4b41fd5175f4a4329f532dda2b456832188d63fc/controller/service.go#L223">getDesiredLbIPs</a>，该函数首先尝试解析 Service <code class="language-plaintext highlighter-rouge">Annotations</code>字段中<code class="language-plaintext highlighter-rouge">metallb.universe.tf/loadBalancerIPs</code>对应的值，该值是一个由<code class="language-plaintext highlighter-rouge">,</code>分割 IP 拼接成的字符串；若该字段为空，则尝试获取<code class="language-plaintext highlighter-rouge">Service.Spec.LoadBalancerIP</code>对应的单个地址作为期望 LB IP。</p>

<p>为什么会存在这么一个期望的 LB IP 呢？因为大多数情况下负载均衡器分配 IP 地址是一个随机的过程，而期望的 LB IP 则描述了用户希望该 Service 使用的 IP。这个 LB IP 在地址分配时，会直接指定给 Service，当然也是在 IP 合法且有效的前提下。另外，若用户指定了期望的 LB IP，则 spec 中 AutoAssign 是要关闭的。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// #4</span>
	<span class="c">// 到此为止，对于没有 LB IP 的 Service 才进行地址分配，详细说明见下节内容</span>
	<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">lbIPs</span><span class="p">)</span> <span class="o">==</span> <span class="m">0</span> <span class="p">{</span>
		<span class="n">lbIPs</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">allocateIPs</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">)</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="c">// IP 分配失败</span>
	<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">lbIPs</span><span class="p">)</span> <span class="o">==</span> <span class="m">0</span> <span class="p">{</span>
		<span class="n">c</span><span class="o">.</span><span class="n">clearServiceState</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">)</span>
		<span class="k">return</span>
	<span class="p">}</span>

	<span class="c">// 检查该分配 IP 对应的地址池是否存在</span>
	<span class="n">pool</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">ips</span><span class="o">.</span><span class="n">Pool</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">pool</span> <span class="o">==</span> <span class="s">""</span> <span class="o">||</span> <span class="n">c</span><span class="o">.</span><span class="n">pools</span> <span class="o">==</span> <span class="no">nil</span> <span class="o">||</span> <span class="n">c</span><span class="o">.</span><span class="n">pools</span><span class="o">.</span><span class="n">IsEmpty</span><span class="p">(</span><span class="n">pool</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">c</span><span class="o">.</span><span class="n">clearServiceState</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">)</span>
		<span class="k">return</span>
	<span class="p">}</span>

	<span class="c">// 最后，记录分配的 IP 到 Service 的 Status 和 Annotations 字段</span>
	<span class="n">lbIngressIPs</span> <span class="o">:=</span> <span class="p">[]</span><span class="n">v1</span><span class="o">.</span><span class="n">LoadBalancerIngress</span><span class="p">{}</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">lbIP</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">lbIPs</span> <span class="p">{</span>
		<span class="n">lbIngressIPs</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">lbIngressIPs</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">LoadBalancerIngress</span><span class="p">{</span><span class="n">IP</span><span class="o">:</span> <span class="n">lbIP</span><span class="o">.</span><span class="n">String</span><span class="p">()})</span>
	<span class="p">}</span>
	<span class="n">svc</span><span class="o">.</span><span class="n">Status</span><span class="o">.</span><span class="n">LoadBalancer</span><span class="o">.</span><span class="n">Ingress</span> <span class="o">=</span> <span class="n">lbIngressIPs</span>
	<span class="k">if</span> <span class="n">svc</span><span class="o">.</span><span class="n">Annotations</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">svc</span><span class="o">.</span><span class="n">Annotations</span> <span class="o">=</span> <span class="nb">make</span><span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">string</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="n">svc</span><span class="o">.</span><span class="n">Annotations</span><span class="p">[</span><span class="n">annotationIPAllocateFromPool</span><span class="p">]</span> <span class="o">=</span> <span class="n">pool</span>  <span class="c">// =&gt; "metallb.universe.tf/ip-allocated-from-pool"</span>
<span class="p">}</span>
</code></pre></div></div>
<p>最后，对于没有 LB IP 的 Service 进行地址分配，并保存到 Service 的<code class="language-plaintext highlighter-rouge">Status</code>和<code class="language-plaintext highlighter-rouge">Annotations</code>字段。地址分配使用的是 Service Controller 的<code class="language-plaintext highlighter-rouge">allocateIPs</code>方法：该方法按照先指定期望的 LB IP，再从指定地址池中分配 IP，最后再从所有的相关地址池中分配 IP 的优先级顺序去处理。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// controller/service.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">controller</span><span class="p">)</span> <span class="n">allocateIPs</span><span class="p">(</span><span class="n">key</span> <span class="kt">string</span><span class="p">,</span> <span class="n">svc</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span><span class="p">)</span> <span class="p">([]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="c">// 确定 Service 所使用的 IP 地址类型，确定方式见上文</span>
	<span class="n">serviceIPFamily</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">ipfamily</span><span class="o">.</span><span class="n">ForService</span><span class="p">(</span><span class="n">svc</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
	<span class="p">}</span>
	<span class="n">desiredLbIPs</span><span class="p">,</span> <span class="n">desiredLbIPFamily</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">getDesiredLbIPs</span><span class="p">(</span><span class="n">svc</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
	<span class="p">}</span>

	<span class="c">// 若用户指定了期望 LB IP，则先尝试分配这个 IP</span>
	<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">desiredLbIPs</span><span class="p">)</span> <span class="o">&gt;</span> <span class="m">0</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">serviceIPFamily</span> <span class="o">!=</span> <span class="n">desiredLbIPFamily</span> <span class="p">{</span>
			<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="c">// err</span>
		<span class="p">}</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">ips</span><span class="o">.</span><span class="n">Assign</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">,</span> <span class="n">desiredLbIPs</span><span class="p">,</span> <span class="n">k8salloc</span><span class="o">.</span><span class="n">Ports</span><span class="p">(</span><span class="n">svc</span><span class="p">),</span> <span class="n">k8salloc</span><span class="o">.</span><span class="n">SharingKey</span><span class="p">(</span><span class="n">svc</span><span class="p">),</span> <span class="n">k8salloc</span><span class="o">.</span><span class="n">BackendKey</span><span class="p">(</span><span class="n">svc</span><span class="p">));</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
		<span class="p">}</span>
		<span class="k">return</span> <span class="n">desiredLbIPs</span><span class="p">,</span> <span class="no">nil</span>
	<span class="p">}</span>
	<span class="c">// 否则，从地址池中分配一个 IP 地址</span>
	<span class="n">desiredPool</span> <span class="o">:=</span> <span class="n">svc</span><span class="o">.</span><span class="n">Annotations</span><span class="p">[</span><span class="n">annotationAddressPool</span><span class="p">]</span>
	<span class="k">if</span> <span class="n">desiredPool</span> <span class="o">!=</span> <span class="s">""</span> <span class="p">{</span>
		<span class="n">ips</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">ips</span><span class="o">.</span><span class="n">AllocateFromPool</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">,</span> <span class="n">serviceIPFamily</span><span class="p">,</span> <span class="n">desiredPool</span><span class="p">,</span> <span class="n">k8salloc</span><span class="o">.</span><span class="n">Ports</span><span class="p">(</span><span class="n">svc</span><span class="p">),</span> <span class="n">k8salloc</span><span class="o">.</span><span class="n">SharingKey</span><span class="p">(</span><span class="n">svc</span><span class="p">),</span> <span class="n">k8salloc</span><span class="o">.</span><span class="n">BackendKey</span><span class="p">(</span><span class="n">svc</span><span class="p">))</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
		<span class="p">}</span>
		<span class="k">return</span> <span class="n">ips</span><span class="p">,</span> <span class="no">nil</span>
	<span class="p">}</span>

	<span class="c">// 若地址池没有被指定，则从所有跟该 Service 相关的地址池中分配</span>
	<span class="k">return</span> <span class="n">c</span><span class="o">.</span><span class="n">ips</span><span class="o">.</span><span class="n">Allocate</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">svc</span><span class="p">,</span> <span class="n">serviceIPFamily</span><span class="p">,</span> <span class="n">k8salloc</span><span class="o">.</span><span class="n">Ports</span><span class="p">(</span><span class="n">svc</span><span class="p">),</span> <span class="n">k8salloc</span><span class="o">.</span><span class="n">SharingKey</span><span class="p">(</span><span class="n">svc</span><span class="p">),</span> <span class="n">k8salloc</span><span class="o">.</span><span class="n">BackendKey</span><span class="p">(</span><span class="n">svc</span><span class="p">))</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="核心结构allocator">核心结构：Allocator</h3>
<p>上文提到的，所有涉及 IP 地址分配与回收的操作，使用的实际上都是由 Allocator 提供的接口，比如<code class="language-plaintext highlighter-rouge">Unassign</code>、<code class="language-plaintext highlighter-rouge">Assign</code>、<code class="language-plaintext highlighter-rouge">Allocate</code>等方法。</p>

<p>Allocator 作为 Service Controller 的一个字段出现，它本身是一个记录了 IP 到 Service 各种信息映射关系的数据结构。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">controller</span> <span class="k">struct</span> <span class="p">{</span>
	<span class="n">client</span> <span class="n">service</span>
	<span class="n">pools</span>  <span class="o">*</span><span class="n">config</span><span class="o">.</span><span class="n">Pools</span>
	<span class="n">ips</span>    <span class="o">*</span><span class="n">allocator</span><span class="o">.</span><span class="n">Allocator</span>
<span class="p">}</span>

<span class="c">// internal/allocator/allocator.go</span>
<span class="k">type</span> <span class="n">Allocator</span> <span class="k">struct</span> <span class="p">{</span>
	<span class="n">pools</span> <span class="o">*</span><span class="n">config</span><span class="o">.</span><span class="n">Pools</span>

	<span class="n">allocated</span>       <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="o">*</span><span class="n">alloc</span>          <span class="c">// svc -&gt; alloc，记录已分配的 IP 信息</span>
	<span class="n">sharingKeyForIP</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="o">*</span><span class="n">key</span>            <span class="c">// ip.String() -&gt; assigned sharing key</span>
	<span class="n">portsInUse</span>      <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="k">map</span><span class="p">[</span><span class="n">Port</span><span class="p">]</span><span class="kt">string</span> <span class="c">// ip.String() -&gt; Port -&gt; svc</span>
	<span class="n">servicesOnIP</span>    <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">bool</span> <span class="c">// ip.String() -&gt; svc -&gt; allocated?</span>
	<span class="n">poolIPsInUse</span>    <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">int</span>  <span class="c">// poolName -&gt; ip.String() -&gt; number of users</span>
<span class="p">}</span>

<span class="k">type</span> <span class="n">alloc</span> <span class="k">struct</span> <span class="p">{</span>
	<span class="n">pool</span>  <span class="kt">string</span>
	<span class="n">ips</span>   <span class="p">[]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span>
	<span class="n">ports</span> <span class="p">[]</span><span class="n">Port</span>
	<span class="n">key</span>   <span class="o">---&gt;---</span>
<span class="p">}</span>                     <span class="err">\</span>
                       <span class="err">\</span>
                      <span class="k">type</span> <span class="n">key</span> <span class="k">struct</span> <span class="p">{</span>
                          <span class="n">sharing</span> <span class="kt">string</span>
                          <span class="n">backend</span> <span class="kt">string</span>
                        <span class="p">}</span>
</code></pre></div></div>
<h4 id="多租户地址池与-ip-生成">多租户地址池与 IP 生成</h4>
<p><code class="language-plaintext highlighter-rouge">Allocate</code>方法是针对分配地址时无指定地址池情况使用的，该情况的处理首先作用于<code class="language-plaintext highlighter-rouge">IPAddressPool.spec.serviceAllocation</code>字段。这个字段是为了实现地址池的多租户能力而引入的，<a href="https://metallb.universe.tf/configuration/_advanced_ipaddresspool_configuration/#reduce-scope-of-address-allocation-to-specific-namespace-and-service">其中涉及了</a>地址池的优先级（值越低优先级越高）、作用命名空间、<a href="https://github.com/metallb/metallb/issues/383">命名空间选择器</a>和 Service 选择器等特性，用于指定地址池的生效范围。若在这些租户的地址池中分配地址失败，才会 fallover 到全局非租户的地址池中尝试分配。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/allocator/allocator.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">Allocator</span><span class="p">)</span> <span class="n">Allocate</span><span class="p">(</span><span class="n">svcKey</span> <span class="kt">string</span><span class="p">,</span> <span class="n">svc</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span><span class="p">,</span> <span class="n">serviceIPFamily</span> <span class="n">ipfamily</span><span class="o">.</span><span class="n">Family</span><span class="p">,</span> <span class="n">ports</span> <span class="p">[]</span><span class="n">Port</span><span class="p">,</span> <span class="n">sharingKey</span><span class="p">,</span> <span class="n">backendKey</span> <span class="kt">string</span><span class="p">)</span> <span class="p">([]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="c">// 对于已经被分配地址的 Service，这里再次尝试指定地址</span>
	<span class="k">if</span> <span class="n">alloc</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">allocated</span><span class="p">[</span><span class="n">svcKey</span><span class="p">];</span> <span class="n">alloc</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="c">// 指定的还是原来已经分配的地址，这里的主要目的是对原地址的合法性再次进行校验；若校验通过，Allocator.allocated 字段虽然会更新，但是内容不变</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">Assign</span><span class="p">(</span><span class="n">svcKey</span><span class="p">,</span> <span class="n">svc</span><span class="p">,</span> <span class="n">alloc</span><span class="o">.</span><span class="n">ips</span><span class="p">,</span> <span class="n">ports</span><span class="p">,</span> <span class="n">sharingKey</span><span class="p">,</span> <span class="n">backendKey</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
		<span class="p">}</span>
		<span class="k">return</span> <span class="n">alloc</span><span class="o">.</span><span class="n">ips</span><span class="p">,</span> <span class="no">nil</span>
	<span class="p">}</span>
	<span class="c">// 获取 serviceAllocation 中规定的，与当前 Service 各种原数据或命名空间相匹配的地址池，并按照地址池的优先级降序排序</span>
	<span class="n">pinnedPools</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">pinnedPoolsForService</span><span class="p">(</span><span class="n">svc</span><span class="p">)</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">pool</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">pinnedPools</span> <span class="p">{</span>
                <span class="c">// 只要从一个地址池中分配 IP 成功，则直接返回该分配的 IP</span>
		<span class="k">if</span> <span class="n">ips</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">AllocateFromPool</span><span class="p">(</span><span class="n">svcKey</span><span class="p">,</span> <span class="n">svc</span><span class="p">,</span> <span class="n">serviceIPFamily</span><span class="p">,</span> <span class="n">pool</span><span class="o">.</span><span class="n">Name</span><span class="p">,</span> <span class="n">ports</span><span class="p">,</span> <span class="n">sharingKey</span><span class="p">,</span> <span class="n">backendKey</span><span class="p">);</span> <span class="n">err</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span> <span class="n">ips</span><span class="p">,</span> <span class="no">nil</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="c">// 遍历所有地址池，过滤掉所有非租户的地址池或不会自动分配 IP 的地址池</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">pool</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">a</span><span class="o">.</span><span class="n">pools</span><span class="o">.</span><span class="n">ByName</span> <span class="p">{</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">pool</span><span class="o">.</span><span class="n">AutoAssign</span> <span class="o">||</span> <span class="n">pool</span><span class="o">.</span><span class="n">ServiceAllocations</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="k">if</span> <span class="n">ips</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">AllocateFromPool</span><span class="p">(</span><span class="n">svcKey</span><span class="p">,</span> <span class="n">svc</span><span class="p">,</span> <span class="n">serviceIPFamily</span><span class="p">,</span> <span class="n">pool</span><span class="o">.</span><span class="n">Name</span><span class="p">,</span> <span class="n">ports</span><span class="p">,</span> <span class="n">sharingKey</span><span class="p">,</span> <span class="n">backendKey</span><span class="p">);</span> <span class="n">err</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span> <span class="n">ips</span><span class="p">,</span> <span class="no">nil</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">errors</span><span class="o">.</span><span class="n">New</span><span class="p">(</span><span class="s">"no available IPs"</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>对于从指定地址池中获取 IP 的过程，MetalLB 会遍历地址池的每个 CIDR，直到每种 IP 类型都被分配了一个 IP 地址为止；最后，再将分配的 IP 指定给当前 Service。其中，从一个 CIDR 中分配 IP，是 <a href="https://github.com/metallb/metallb/blob/v0.13.9/internal/allocator/allocator.go#L468">getIPFromCIDR</a> 方法完成的工作，该方法本质上是调用的 <a href="https://github.com/mikioh/ipaddr">ipaddr</a> 库函数，MetalLB 使用该库完成对 IP 地址分配的追踪。除此之外，在该方法中还跳过了使用 <a href="https://metallb.universe.tf/usage/#ip-address-sharing">IP 地址共享</a>和 <a href="https://metallb.universe.tf/configuration/_advanced_ipaddresspool_configuration/#handling-buggy-networks">buggy 网络</a>的地址。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">Allocator</span><span class="p">)</span> <span class="n">AllocateFromPool</span><span class="p">(</span><span class="n">svcKey</span> <span class="kt">string</span><span class="p">,</span> <span class="n">svc</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span><span class="p">,</span> <span class="n">serviceIPFamily</span> <span class="n">ipfamily</span><span class="o">.</span><span class="n">Family</span><span class="p">,</span> <span class="n">poolName</span> <span class="kt">string</span><span class="p">,</span> <span class="n">ports</span> <span class="p">[]</span><span class="n">Port</span><span class="p">,</span> <span class="n">sharingKey</span><span class="p">,</span> <span class="n">backendKey</span> <span class="kt">string</span><span class="p">)</span> <span class="p">([]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">alloc</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">allocated</span><span class="p">[</span><span class="n">svcKey</span><span class="p">];</span> <span class="n">alloc</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="c">// ...</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">Assign</span><span class="p">(</span><span class="n">svcKey</span><span class="p">,</span> <span class="n">svc</span><span class="p">,</span> <span class="n">alloc</span><span class="o">.</span><span class="n">ips</span><span class="p">,</span> <span class="n">ports</span><span class="p">,</span> <span class="n">sharingKey</span><span class="p">,</span> <span class="n">backendKey</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
		<span class="p">}</span>
		<span class="k">return</span> <span class="n">alloc</span><span class="o">.</span><span class="n">ips</span><span class="p">,</span> <span class="no">nil</span>
	<span class="p">}</span>

	<span class="c">// 获取该指定的地址池对象</span>
	<span class="n">pool</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">pools</span><span class="o">.</span><span class="n">ByName</span><span class="p">[</span><span class="n">poolName</span><span class="p">]</span>
	<span class="n">ips</span> <span class="o">:=</span> <span class="p">[]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">{}</span>
	<span class="c">// 根据 IP 地址家族决定分配的地址类型</span>
	<span class="n">ipfamilySel</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="n">ipfamily</span><span class="o">.</span><span class="n">Family</span><span class="p">]</span><span class="kt">bool</span><span class="p">)</span>
	<span class="k">switch</span> <span class="n">serviceIPFamily</span> <span class="p">{</span>
	<span class="k">case</span> <span class="n">ipfamily</span><span class="o">.</span><span class="n">DualStack</span><span class="o">:</span>
		<span class="n">ipfamilySel</span><span class="p">[</span><span class="n">ipfamily</span><span class="o">.</span><span class="n">IPv4</span><span class="p">],</span> <span class="n">ipfamilySel</span><span class="p">[</span><span class="n">ipfamily</span><span class="o">.</span><span class="n">IPv6</span><span class="p">]</span> <span class="o">=</span> <span class="no">true</span><span class="p">,</span> <span class="no">true</span>
	<span class="k">default</span><span class="o">:</span>
		<span class="n">ipfamilySel</span><span class="p">[</span><span class="n">serviceIPFamily</span><span class="p">]</span> <span class="o">=</span> <span class="no">true</span>
	<span class="p">}</span>

	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">cidr</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">pool</span><span class="o">.</span><span class="n">CIDR</span> <span class="p">{</span>
		<span class="c">// 地址池的 CIDR 要在和目的 IP 地址类型相同时，才能被分配</span>
		<span class="n">cidrIPFamily</span> <span class="o">:=</span> <span class="n">ipfamily</span><span class="o">.</span><span class="n">ForCIDR</span><span class="p">(</span><span class="n">cidr</span><span class="p">)</span>
		<span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">ipfamilySel</span><span class="p">[</span><span class="n">cidrIPFamily</span><span class="p">];</span> <span class="o">!</span><span class="n">ok</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="n">ip</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">getIPFromCIDR</span><span class="p">(</span><span class="n">cidr</span><span class="p">,</span> <span class="n">pool</span><span class="o">.</span><span class="n">AvoidBuggyIPs</span><span class="p">,</span> <span class="n">svcKey</span><span class="p">,</span> <span class="n">ports</span><span class="p">,</span> <span class="n">sharingKey</span><span class="p">,</span> <span class="n">backendKey</span><span class="p">)</span>  <span class="c">// 获取 IP</span>
		<span class="k">if</span> <span class="n">ip</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">ips</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">ips</span><span class="p">,</span> <span class="n">ip</span><span class="p">)</span>
			<span class="nb">delete</span><span class="p">(</span><span class="n">ipfamilySel</span><span class="p">,</span> <span class="n">cidrIPFamily</span><span class="p">)</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="c">// 存在没有被分配的 IP 地址类型，说明地址池已耗尽</span>
	<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">ipfamilySel</span><span class="p">)</span> <span class="o">&gt;</span> <span class="m">0</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="c">// err</span>
	<span class="p">}</span>
	<span class="n">err</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">Assign</span><span class="p">(</span><span class="n">svcKey</span><span class="p">,</span> <span class="n">svc</span><span class="p">,</span> <span class="n">ips</span><span class="p">,</span> <span class="n">ports</span><span class="p">,</span> <span class="n">sharingKey</span><span class="p">,</span> <span class="n">backendKey</span><span class="p">)</span>  <span class="c">// 将分配后的 IP 指定给 Service</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">ips</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>对于分配完成的 IP，则要通过<code class="language-plaintext highlighter-rouge">Assign</code>方法指定给对应的 Service。该方法首先对地址池和 IP 的有效性进行检查（包括检查共享 IP 的可用性），然后调用<code class="language-plaintext highlighter-rouge">assign</code>方法更新<code class="language-plaintext highlighter-rouge">Allocator</code>结构体的各个字段内容，例如：<code class="language-plaintext highlighter-rouge">a.allocated[svc] = alloc</code>。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">Allocator</span><span class="p">)</span> <span class="n">Assign</span><span class="p">(</span><span class="n">svcKey</span> <span class="kt">string</span><span class="p">,</span> <span class="n">svc</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span><span class="p">,</span> <span class="n">ips</span> <span class="p">[]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="n">ports</span> <span class="p">[]</span><span class="n">Port</span><span class="p">,</span> <span class="n">sharingKey</span><span class="p">,</span> <span class="n">backendKey</span> <span class="kt">string</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>

	<span class="c">// check ...</span>

	<span class="n">alloc</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">alloc</span><span class="p">{</span>
		<span class="n">pool</span><span class="o">:</span>  <span class="n">pool</span><span class="o">.</span><span class="n">Name</span><span class="p">,</span>
		<span class="n">ips</span><span class="o">:</span>   <span class="n">ips</span><span class="p">,</span>
		<span class="n">ports</span><span class="o">:</span> <span class="nb">make</span><span class="p">([]</span><span class="n">Port</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">ports</span><span class="p">)),</span>
		<span class="n">key</span><span class="o">:</span>   <span class="o">*</span><span class="n">sk</span><span class="p">,</span>
	<span class="p">}</span>
	<span class="nb">copy</span><span class="p">(</span><span class="n">alloc</span><span class="o">.</span><span class="n">ports</span><span class="p">,</span> <span class="n">ports</span><span class="p">)</span>
	<span class="n">a</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">svcKey</span><span class="p">,</span> <span class="n">alloc</span><span class="p">)</span>
	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>与之同理，<code class="language-plaintext highlighter-rouge">Unassign</code>方法用来回收 IP，其主要的工作就是清理<code class="language-plaintext highlighter-rouge">Allocator</code>结构体的各个字段跟当前 Service 有关的内容，例如：<code class="language-plaintext highlighter-rouge">delete(a.allocated, svc)</code>。</p>
<h4 id="ip-地址共享机制">IP 地址共享机制</h4>
<p>在上文的一些逻辑分析中，忽略了 <a href="https://metallb.universe.tf/usage/#ip-address-sharing">IP 地址共享</a>这种情况。MetalLB 引入 IP 地址共享这个功能，主要有两个目的：</p>

<ul>
  <li>打破 K8s 不支持 LoadBalancer 类型的 Service 在同一端口运行多协议的限制</li>
  <li>当实际 Service 数量比可用 IP 地址数多时，用于解决 IP 地址不够用的问题</li>
</ul>

<p>至于第一点，对于一个 DNS 服务就很实用，因为 DNS 服务既要监听 TCP 也要监听 UDP。但由于 K8s 的限制，不可能创建一个这样的 LoadBalancer Service。但在 MetalLB 中，可以通过创建两个 sharing-key 和<code class="language-plaintext highlighter-rouge">spec.loadBalancerIP</code>相同的服务，每个服务都关联相同的 pod 来解决这个问题。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-06/metallb-ip-sharing.png" alt="metallb-ip-sharing" /></p>

<p>对于使用 IP 地址共享的两个 Service 也存在一些条件限制：</p>
<ol>
  <li>它们需要拥有相同的 sharing-key <code class="language-plaintext highlighter-rouge">Annotation</code></li>
  <li>它们不能对相同端口使用相同的协议</li>
  <li>它们都使用<code class="language-plaintext highlighter-rouge">Cluster</code>模式的 External TrafficPolicy，或它们所代理的 pods 一样</li>
</ol>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/allocator/allocator.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">Allocator</span><span class="p">)</span> <span class="n">checkSharing</span><span class="p">(</span><span class="n">svc</span> <span class="kt">string</span><span class="p">,</span> <span class="n">ip</span> <span class="kt">string</span><span class="p">,</span> <span class="n">ports</span> <span class="p">[]</span><span class="n">Port</span><span class="p">,</span> <span class="n">sk</span> <span class="o">*</span><span class="n">key</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">existingSK</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">sharingKeyForIP</span><span class="p">[</span><span class="n">ip</span><span class="p">];</span> <span class="n">existingSK</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="c">// 检查 sharing-key 是否相同</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">sharingOK</span><span class="p">(</span><span class="n">existingSK</span><span class="p">,</span> <span class="n">sk</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="c">// ...</span>
		<span class="p">}</span>

		<span class="c">// 检查端口是否被占用，端口由协议和端口号两部分组成</span>
		<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">port</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">ports</span> <span class="p">{</span>
			<span class="k">if</span> <span class="n">curSvc</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">portsInUse</span><span class="p">[</span><span class="n">ip</span><span class="p">][</span><span class="n">port</span><span class="p">];</span> <span class="n">ok</span> <span class="o">&amp;&amp;</span> <span class="n">curSvc</span> <span class="o">!=</span> <span class="n">svc</span> <span class="p">{</span>
				<span class="k">return</span> <span class="c">// err</span>
			<span class="p">}</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="外部广播">外部广播</h2>
<p>待 MetalLB 给 Service 分配了一个 IP（External IP）之后，它还需要让外部集群的网络感知到这个 IP 的存在，即需要为 IP 对外进行广播。MetalLB 使用了标准路由协议（ARP、NDP 和 BGP）来实现这点，对此其拥有两种工作模式。</p>

<p>这两种工作模式在默认情况下是同时启用的，每种工作模式都有其对应的 controller 实现。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// speaker/main.go</span>

<span class="k">func</span> <span class="n">newController</span><span class="p">(</span><span class="n">cfg</span> <span class="n">controllerConfig</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">controller</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">handlers</span> <span class="o">:=</span> <span class="k">map</span><span class="p">[</span><span class="n">config</span><span class="o">.</span><span class="n">Proto</span><span class="p">]</span><span class="n">Protocol</span><span class="p">{</span>
		<span class="n">config</span><span class="o">.</span><span class="n">BGP</span><span class="o">:</span> <span class="o">&amp;</span><span class="n">bgpController</span><span class="p">{</span><span class="c">/*...*/</span><span class="p">},</span>
	<span class="p">}</span>
	<span class="n">protocols</span> <span class="o">:=</span> <span class="p">[]</span><span class="n">config</span><span class="o">.</span><span class="n">Proto</span><span class="p">{</span><span class="n">config</span><span class="o">.</span><span class="n">BGP</span><span class="p">}</span>

	<span class="k">if</span> <span class="o">!</span><span class="n">cfg</span><span class="o">.</span><span class="n">DisableLayer2</span> <span class="p">{</span>  <span class="c">// 虽然有 Layer2 模式的开关，但在实现中并没有发现该配置的可设置项</span>
		<span class="n">a</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">layer2</span><span class="o">.</span><span class="n">New</span><span class="p">(</span><span class="n">cfg</span><span class="o">.</span><span class="n">Logger</span><span class="p">)</span>  <span class="c">// 初始化 Layer2 Announcer</span>
		<span class="n">handlers</span><span class="p">[</span><span class="n">config</span><span class="o">.</span><span class="n">Layer2</span><span class="p">]</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">layer2Controller</span><span class="p">{</span><span class="c">/*...*/</span><span class="p">}</span>
		<span class="n">protocols</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">protocols</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">Layer2</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="n">ret</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">controller</span><span class="p">{</span>  <span class="c">// 初始化 speaker 的 controller</span>
		<span class="c">// ...</span>
                <span class="n">protocolHandlers</span><span class="o">:</span> <span class="n">handlers</span><span class="p">,</span>
		<span class="n">announced</span><span class="o">:</span>        <span class="k">map</span><span class="p">[</span><span class="n">config</span><span class="o">.</span><span class="n">Proto</span><span class="p">]</span><span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">bool</span><span class="p">{},</span>
		<span class="n">svcIPs</span><span class="o">:</span>           <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">][]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">{},</span>
		<span class="n">protocols</span><span class="o">:</span>        <span class="n">protocols</span><span class="p">,</span>
	<span class="p">}</span>
	<span class="n">ret</span><span class="o">.</span><span class="n">announced</span><span class="p">[</span><span class="n">config</span><span class="o">.</span><span class="n">BGP</span><span class="p">]</span> <span class="o">=</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">bool</span><span class="p">{}</span>
	<span class="n">ret</span><span class="o">.</span><span class="n">announced</span><span class="p">[</span><span class="n">config</span><span class="o">.</span><span class="n">Layer2</span><span class="p">]</span> <span class="o">=</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">bool</span><span class="p">{}</span>

	<span class="k">return</span> <span class="n">ret</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>这些 controller 都实现了<code class="language-plaintext highlighter-rouge">Protocol</code>接口，即满足了对外宣告 External IP 的基本方法。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">Protocol</span> <span class="k">interface</span> <span class="p">{</span>
	<span class="n">SetConfig</span><span class="p">(</span><span class="n">log</span><span class="o">.</span><span class="n">Logger</span><span class="p">,</span> <span class="o">*</span><span class="n">config</span><span class="o">.</span><span class="n">Config</span><span class="p">)</span> <span class="kt">error</span>
	<span class="n">ShouldAnnounce</span><span class="p">(</span><span class="n">log</span><span class="o">.</span><span class="n">Logger</span><span class="p">,</span> <span class="kt">string</span><span class="p">,</span> <span class="p">[]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="o">*</span><span class="n">config</span><span class="o">.</span><span class="n">Pool</span><span class="p">,</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span><span class="p">,</span> <span class="n">epslices</span><span class="o">.</span><span class="n">EpsOrSlices</span><span class="p">)</span> <span class="kt">string</span>
	<span class="n">SetBalancer</span><span class="p">(</span><span class="n">log</span><span class="o">.</span><span class="n">Logger</span><span class="p">,</span> <span class="kt">string</span><span class="p">,</span> <span class="p">[]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="o">*</span><span class="n">config</span><span class="o">.</span><span class="n">Pool</span><span class="p">,</span> <span class="n">service</span><span class="p">,</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span><span class="p">)</span> <span class="kt">error</span>
	<span class="n">DeleteBalancer</span><span class="p">(</span><span class="n">log</span><span class="o">.</span><span class="n">Logger</span><span class="p">,</span> <span class="kt">string</span><span class="p">,</span> <span class="kt">string</span><span class="p">)</span> <span class="kt">error</span>
	<span class="n">SetNode</span><span class="p">(</span><span class="n">log</span><span class="o">.</span><span class="n">Logger</span><span class="p">,</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Node</span><span class="p">)</span> <span class="kt">error</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在 speaker 中，任何与 Service 资源更新相关的事件都会被 Speaker 的 Controller 捕获，并调用每种工作模式进行处理。在<code class="language-plaintext highlighter-rouge">handleService</code>方法中，每种工作模式会先使用<code class="language-plaintext highlighter-rouge">ShouldAnnounce</code>来检查当前 Node 是否可以被用来做宣告工作；之后再使用<code class="language-plaintext highlighter-rouge">SetBalancer</code>来进行 IP 宣告。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">protocol</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">c</span><span class="o">.</span><span class="n">protocols</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">st</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">handleService</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">lbIPs</span><span class="p">,</span> <span class="n">svc</span><span class="p">,</span> <span class="n">pool</span><span class="p">,</span> <span class="n">eps</span><span class="p">,</span> <span class="n">protocol</span><span class="p">);</span> <span class="n">st</span> <span class="o">==</span> <span class="n">controllers</span><span class="o">.</span><span class="n">SyncStateError</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">st</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="layer2-模式">Layer2 模式</h3>
<p>在 L2 模式中，由一个 Node 上的 speaker 组件（DaemonSet）负责宣告 Service 在一个子网中的 External IP 地址（leader speaker），即该 IP 地址会出现在其 Node 的网络接口上，作为外界访问服务的流量入口。所有对 Service External IP 的流量都会被路由到一个 Node 上，当流量进入 Node 后，<a href="https://shawnh2.github.io/post/2023/05/18/kube-proxy-walk-through.html#loadbalancer">kube-proxy 会负责将流量分发到 Service 代理的不同 Pod 上</a>。因为所有流量都只通过一个 Node 进入，所以严格意义上讲，MetalLB 并没有在 L2 模式中实现负载均衡器。相反，而是实现了一套<strong>故障转移</strong>或<strong>高可用机制</strong>，即当一个 speaker 不可用时，会有其他 Node 上的 speaker 接管宣告 Service External IP 的工作。</p>

<p>由于一个集群中可能会出现多个地址池，即多个子网，故针对每个子网，都会实施故障转移机制。如下图所示，Node A 和 B 属于同一个子网 A，那么 Node A 和 B 其中一个会被选为子网 A 的 leader speaker；而对于 Node C 来说，由于只有一个 Node 属于子网 B，故 Node C 会一直作为该子网的 leader speaker。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-06/metallb-l2-subnet.png" alt="metallb-l2-subnet" /></p>

<p>在路由协议的选择上，对于一个 ipv4 类型的 Service，speaker 会通过 ARP 请求来宣告 IP 地址；对于一个 ipv6 类型的 Service，speaker 则会通过 NDP 请求。值得注意的是，由于 L2 模式依赖 ARP 和 NDP 协议，所以<strong>必须保证</strong>请求客户端所在的网络与 Service External IP 属于同一个子网。</p>

<p>除此之外，当流量进入到 Node 时，kube-proxy 还会根据 Service 设置的不同<code class="language-plaintext highlighter-rouge">ExternalTrafficPolicy</code>来转发外部流量：</p>

<ul>
  <li>若为策略<code class="language-plaintext highlighter-rouge">cluster</code>（默认），kube-proxy 会把流量转发到集群中该服务代理的所有不同 Pod 上。由于 kube-proxy 会对请求进行源地址伪装，所以在最终接收到这些外部流量时，它们的源地址都为 leader speaker 所在 Node 的 IP</li>
  <li>若为策略<code class="language-plaintext highlighter-rouge">local</code>，kube-proxy 只会把流量转发到在当前 Node 上的 Service Pod，虽然这些 Pod 接受到流量的源地址是外部地址，但只会命中少部分 Pod，容易造成流量失衡</li>
</ul>

<h4 id="leader-选举">Leader 选举</h4>
<p>在选举的过程中，leader speaker 候选者的产生存在以下几点前提要求：</p>

<ul>
  <li>leader speaker 候选者<strong>必须</strong>要在被子网选中的 Node 上，Node 的挑选可通过 NodeSelector 进行，若不指定 Selector 则默认使用所有 Node</li>
  <li>Service 代理的所有 Pod <strong>必须处于</strong> Ready 状态</li>
</ul>

<p>在 L2 模式下的完整 leader 选举流程，由<code class="language-plaintext highlighter-rouge">ShouldAnnounce</code>方法实现，该方法<strong>只要</strong>返回非空字符串，就说明此 speaker 不适合做 leader。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// speaker/layer2_controller.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">layer2Controller</span><span class="p">)</span> <span class="n">ShouldAnnounce</span><span class="p">(</span><span class="n">l</span> <span class="n">log</span><span class="o">.</span><span class="n">Logger</span><span class="p">,</span> <span class="n">name</span> <span class="kt">string</span><span class="p">,</span> <span class="n">toAnnounce</span> <span class="p">[]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="n">pool</span> <span class="o">*</span><span class="n">config</span><span class="o">.</span><span class="n">Pool</span><span class="p">,</span> <span class="n">svc</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span><span class="p">,</span> <span class="n">eps</span> <span class="n">epslices</span><span class="o">.</span><span class="n">EpsOrSlices</span><span class="p">)</span> <span class="kt">string</span> <span class="p">{</span>
	<span class="c">// 检查 Endpoint 或 EndpointSlice 是否处于 Ready 状态</span>
	<span class="k">if</span> <span class="o">!</span><span class="n">activeEndpointExists</span><span class="p">(</span><span class="n">eps</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">return</span> <span class="s">"notOwner"</span>
	<span class="p">}</span>

	<span class="c">// 检查 speaker 所在 Node 是否匹配地址池中 L2Advertisements 的 NodeSelector</span>
	<span class="k">if</span> <span class="o">!</span><span class="n">poolMatchesNodeL2</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">myNode</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">return</span> <span class="s">"notOwner"</span>
	<span class="p">}</span>

	<span class="c">// 选出所有匹配地址池中 L2Advertisements NodeSelector 的 speaker Node</span>
	<span class="n">forPool</span> <span class="o">:=</span> <span class="n">speakersForPool</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">sList</span><span class="o">.</span><span class="n">UsableSpeakers</span><span class="p">(),</span> <span class="n">pool</span><span class="p">)</span>  <span class="c">// 当然是从所有有效的 speaker 中选</span>
	<span class="k">var</span> <span class="n">nodes</span> <span class="p">[]</span><span class="kt">string</span>
	<span class="c">// 根据不同的外部流量策略，选出候选 Node</span>
	<span class="k">if</span> <span class="n">svc</span><span class="o">.</span><span class="n">Spec</span><span class="o">.</span><span class="n">ExternalTrafficPolicy</span> <span class="o">==</span> <span class="n">v1</span><span class="o">.</span><span class="n">ServiceExternalTrafficPolicyTypeLocal</span> <span class="p">{</span>
		<span class="c">// 对于 local 类型，只有 Endpoints 出现在的 Node 才可作为候选</span>
		<span class="n">nodes</span> <span class="o">=</span> <span class="n">usableNodes</span><span class="p">(</span><span class="n">eps</span><span class="p">,</span> <span class="n">forPool</span><span class="p">)</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="c">// 对于 cluster 类型，上述所有 Node 都可作为候选</span>
		<span class="n">nodes</span> <span class="o">=</span> <span class="n">nodesWithActiveSpeakers</span><span class="p">(</span><span class="n">forPool</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="n">ipString</span> <span class="o">:=</span> <span class="n">toAnnounce</span><span class="p">[</span><span class="m">0</span><span class="p">]</span><span class="o">.</span><span class="n">String</span><span class="p">()</span>
	<span class="c">// 根据 node 名 + LB IP 的哈希值对 nodes 进行排序</span>
	<span class="n">sort</span><span class="o">.</span><span class="n">Slice</span><span class="p">(</span><span class="n">nodes</span><span class="p">,</span> <span class="k">func</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="kt">int</span><span class="p">)</span> <span class="kt">bool</span> <span class="p">{</span>
		<span class="n">hi</span> <span class="o">:=</span> <span class="n">sha256</span><span class="o">.</span><span class="n">Sum256</span><span class="p">([]</span><span class="kt">byte</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="s">"#"</span> <span class="o">+</span> <span class="n">ipString</span><span class="p">))</span>
		<span class="n">hj</span> <span class="o">:=</span> <span class="n">sha256</span><span class="o">.</span><span class="n">Sum256</span><span class="p">([]</span><span class="kt">byte</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">+</span> <span class="s">"#"</span> <span class="o">+</span> <span class="n">ipString</span><span class="p">))</span>
		<span class="k">return</span> <span class="n">bytes</span><span class="o">.</span><span class="n">Compare</span><span class="p">(</span><span class="n">hi</span><span class="p">[</span><span class="o">:</span><span class="p">],</span> <span class="n">hj</span><span class="p">[</span><span class="o">:</span><span class="p">])</span> <span class="o">&lt;</span> <span class="m">0</span>
	<span class="p">})</span>

	<span class="c">// 若当前 speaker Node 是排序后 Node 列表中的第一个，则就该由本 speaker 来承担宣告工作</span>
	<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">nodes</span><span class="p">)</span> <span class="o">&gt;</span> <span class="m">0</span> <span class="o">&amp;&amp;</span> <span class="n">nodes</span><span class="p">[</span><span class="m">0</span><span class="p">]</span> <span class="o">==</span> <span class="n">c</span><span class="o">.</span><span class="n">myNode</span> <span class="p">{</span>
		<span class="k">return</span> <span class="s">""</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="s">"notOwner"</span>
<span class="p">}</span>
</code></pre></div></div>
<p>leader speaker 候选者的产生还跟<code class="language-plaintext highlighter-rouge">ExternalTrafficPolicy</code>有关，如下图所示。对于 local 类型的外部流量策略来说，其只选择了 Service Pod 所在的 Node，因为若 leader speaker 选在了一个没有 Service Pod 的 Node 上，当外部流量进入该 Node 时，不会有任何的 Pod 来响应流量。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-06/metallb-announce.png" alt="metallb-announce" /></p>

<p>在选举 leader speaker 时，还对所有 Node 进行了一个排序。排序时只考虑了 Node Name 和 LB IP 两个因素，这种考虑对于共享的 IP 地址来说也管用，因为对于拥有相同 IP 的不同 Services 来说，它们的排序结果是唯一的。由于<code class="language-plaintext highlighter-rouge">ShouldAnnounce</code>方法被所有 speaker 执行，而且最终只选取当前 Node 与排序后第一个 Node 相同的 speaker，故最终选举的 leader speaker 只会存在一个。</p>
<h4 id="announcer-与接口">Announcer 与接口</h4>
<p>L2 controller 在初始化前，还初始化了 Announcer，该结构专门用于通告能映射当前节点 MAC 地址的新 IP，同时还启动了两个 goroutine 定时任务：<code class="language-plaintext highlighter-rouge">interfaceScan</code>用于定时扫描（固定每 10s 一次） Node 上的可用接口；<code class="language-plaintext highlighter-rouge">spamLoop</code>用于定时主动发送 ARP/NDP 响应（也监听<code class="language-plaintext highlighter-rouge">spamCh</code>通道）。</p>

<p>接口可用性判定的主要规则如下，其主要是确定<strong>接口是否启动、Linux 文件中是否存在该网络接口的符号链接，以及接口是否支持广播、是否开启 ARP 协议来解析目的 IP 的 MAC 地址</strong>。对于每一个可用的接口，speaker 都会根据其地址类型创建一个对应的 ARP/NDP Responder 实例，用于完成接口对各协议的请求与响应。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/layer2/announcer.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">Announce</span><span class="p">)</span> <span class="n">updateInterfaces</span><span class="p">()</span> <span class="p">{</span>
	<span class="n">ifs</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">net</span><span class="o">.</span><span class="n">Interfaces</span><span class="p">()</span>
	<span class="c">// ...</span>

	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">intf</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">ifs</span> <span class="p">{</span>
		<span class="n">ifi</span> <span class="o">:=</span> <span class="n">intf</span>

		<span class="k">if</span> <span class="n">ifi</span><span class="o">.</span><span class="n">Flags</span><span class="o">&amp;</span><span class="n">net</span><span class="o">.</span><span class="n">FlagUp</span> <span class="o">==</span> <span class="m">0</span> <span class="p">{</span>  <span class="c">// 是否启动</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">Stat</span><span class="p">(</span><span class="s">"/sys/class/net/"</span> <span class="o">+</span> <span class="n">ifi</span><span class="o">.</span><span class="n">Name</span> <span class="o">+</span> <span class="s">"/master"</span><span class="p">);</span> <span class="o">!</span><span class="n">os</span><span class="o">.</span><span class="n">IsNotExist</span><span class="p">(</span><span class="n">err</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// 是否存在</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="n">f</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">ReadFile</span><span class="p">(</span><span class="s">"/sys/class/net/"</span> <span class="o">+</span> <span class="n">ifi</span><span class="o">.</span><span class="n">Name</span> <span class="o">+</span> <span class="s">"/flags"</span><span class="p">)</span>  <span class="c">// 是否支持 ARP</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">flags</span><span class="p">,</span> <span class="n">_</span> <span class="o">:=</span> <span class="n">strconv</span><span class="o">.</span><span class="n">ParseUint</span><span class="p">(</span><span class="kt">string</span><span class="p">(</span><span class="n">f</span><span class="p">)[</span><span class="o">:</span><span class="nb">len</span><span class="p">(</span><span class="kt">string</span><span class="p">(</span><span class="n">f</span><span class="p">))</span><span class="o">-</span><span class="m">1</span><span class="p">],</span> <span class="m">0</span><span class="p">,</span> <span class="m">32</span><span class="p">)</span>
			<span class="c">// NOARP flag</span>
			<span class="k">if</span> <span class="n">flags</span><span class="o">&amp;</span><span class="m">0x80</span> <span class="o">!=</span> <span class="m">0</span> <span class="p">{</span>
				<span class="k">continue</span>
			<span class="p">}</span>
		<span class="p">}</span>
		<span class="k">if</span> <span class="n">ifi</span><span class="o">.</span><span class="n">Flags</span><span class="o">&amp;</span><span class="n">net</span><span class="o">.</span><span class="n">FlagBroadcast</span> <span class="o">!=</span> <span class="m">0</span> <span class="p">{</span>  <span class="c">// 是否支持广播</span>
			<span class="n">keepARP</span><span class="p">[</span><span class="n">ifi</span><span class="o">.</span><span class="n">Index</span><span class="p">]</span> <span class="o">=</span> <span class="no">true</span>
		<span class="p">}</span>

		<span class="c">// ...</span>

		<span class="c">//初始化并保存所有接口对应的 Responder</span>
		<span class="k">if</span> <span class="n">keepARP</span><span class="p">[</span><span class="n">ifi</span><span class="o">.</span><span class="n">Index</span><span class="p">]</span> <span class="o">&amp;&amp;</span> <span class="n">a</span><span class="o">.</span><span class="n">arps</span><span class="p">[</span><span class="n">ifi</span><span class="o">.</span><span class="n">Index</span><span class="p">]</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">newARPResponder</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">logger</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ifi</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">shouldAnnounce</span><span class="p">)</span>
			<span class="n">a</span><span class="o">.</span><span class="n">arps</span><span class="p">[</span><span class="n">ifi</span><span class="o">.</span><span class="n">Index</span><span class="p">]</span> <span class="o">=</span> <span class="n">resp</span>
		<span class="p">}</span>
		<span class="k">if</span> <span class="n">keepNDP</span><span class="p">[</span><span class="n">ifi</span><span class="o">.</span><span class="n">Index</span><span class="p">]</span> <span class="o">&amp;&amp;</span> <span class="n">a</span><span class="o">.</span><span class="n">ndps</span><span class="p">[</span><span class="n">ifi</span><span class="o">.</span><span class="n">Index</span><span class="p">]</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">newNDPResponder</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">logger</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ifi</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">shouldAnnounce</span><span class="p">)</span>
			<span class="n">a</span><span class="o">.</span><span class="n">ndps</span><span class="p">[</span><span class="n">ifi</span><span class="o">.</span><span class="n">Index</span><span class="p">]</span> <span class="o">=</span> <span class="n">resp</span>
		<span class="p">}</span>
	<span class="p">}</span>

    <span class="c">// ...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在进行对外广播时，L2 controller 会将 Announcer 统计的 <strong>Node 上的所有接口</strong>与<code class="language-plaintext highlighter-rouge">L2Advertisement</code> CR 中规定使用的接口进行比较，只要有一个规定的接口属于所有接口，就会使用规定的接口进行对外广播。最终为 Service 的每个 LB IP 都可以生成一个<code class="language-plaintext highlighter-rouge">IPAdvertisement</code>的结构，其记录了与当前 IP 相关的接口集合。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// speaker/layer2_controller.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">layer2Controller</span><span class="p">)</span> <span class="n">SetBalancer</span><span class="p">(</span><span class="n">l</span> <span class="n">log</span><span class="o">.</span><span class="n">Logger</span><span class="p">,</span> <span class="n">name</span> <span class="kt">string</span><span class="p">,</span> <span class="n">lbIPs</span> <span class="p">[]</span><span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="n">pool</span> <span class="o">*</span><span class="n">config</span><span class="o">.</span><span class="n">Pool</span><span class="p">,</span> <span class="n">client</span> <span class="n">service</span><span class="p">,</span> <span class="n">svc</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Service</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="c">// 获取 Announcer 统计的接口</span>
	<span class="n">ifs</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">announcer</span><span class="o">.</span><span class="n">GetInterfaces</span><span class="p">()</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">lbIP</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">lbIPs</span> <span class="p">{</span>
		<span class="c">// 获取该 LB IP 对应的 IPAdvertisement，里面记录了规定使用的接口</span>
		<span class="n">ipAdv</span> <span class="o">:=</span> <span class="n">ipAdvertisementFor</span><span class="p">(</span><span class="n">lbIP</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">myNode</span><span class="p">,</span> <span class="n">pool</span><span class="o">.</span><span class="n">L2Advertisements</span><span class="p">)</span>
		<span class="c">// 对比看两者接口是否匹配</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">ipAdv</span><span class="o">.</span><span class="n">MatchInterfaces</span><span class="p">(</span><span class="n">ifs</span><span class="o">...</span><span class="p">)</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="n">c</span><span class="o">.</span><span class="n">announcer</span><span class="o">.</span><span class="n">SetBalancer</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">ipAdv</span><span class="p">)</span>  <span class="c">// 对外进行广播</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">ipAdvertisementFor</span><span class="p">(</span><span class="n">ip</span> <span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="n">localNode</span> <span class="kt">string</span><span class="p">,</span> <span class="n">l2Advertisements</span> <span class="p">[]</span><span class="o">*</span><span class="n">config</span><span class="o">.</span><span class="n">L2Advertisement</span><span class="p">)</span> <span class="n">layer2</span><span class="o">.</span><span class="n">IPAdvertisement</span> <span class="p">{</span>
	<span class="n">ifs</span> <span class="o">:=</span> <span class="n">sets</span><span class="o">.</span><span class="n">Set</span><span class="p">[</span><span class="kt">string</span><span class="p">]{}</span>  <span class="c">// 记录规定使用的接口</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">l2</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">l2Advertisements</span> <span class="p">{</span>
		<span class="c">// 跳过不属于该 Node 的地址池</span>
		<span class="k">if</span> <span class="n">matchNode</span> <span class="o">:=</span> <span class="n">l2</span><span class="o">.</span><span class="n">Nodes</span><span class="p">[</span><span class="n">localNode</span><span class="p">];</span> <span class="o">!</span><span class="n">matchNode</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="c">// 若要使用所有接口，不设置任何配置即可</span>
		<span class="k">if</span> <span class="n">l2</span><span class="o">.</span><span class="n">AllInterfaces</span> <span class="p">{</span>
			<span class="k">return</span> <span class="n">layer2</span><span class="o">.</span><span class="n">NewIPAdvertisement</span><span class="p">(</span><span class="n">ip</span><span class="p">,</span> <span class="no">true</span><span class="p">,</span> <span class="n">sets</span><span class="o">.</span><span class="n">Set</span><span class="p">[</span><span class="kt">string</span><span class="p">]{})</span>
		<span class="p">}</span>
		<span class="n">ifs</span> <span class="o">=</span> <span class="n">ifs</span><span class="o">.</span><span class="n">Insert</span><span class="p">(</span><span class="n">l2</span><span class="o">.</span><span class="n">Interfaces</span><span class="o">...</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">layer2</span><span class="o">.</span><span class="n">NewIPAdvertisement</span><span class="p">(</span><span class="n">ip</span><span class="p">,</span> <span class="no">false</span><span class="p">,</span> <span class="n">ifs</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>上文提及的“指定接口用于广播”是 MetalLB 在 <a href="https://github.com/metallb/metallb/issues/277">#277</a> 中提出，并由 <a href="https://github.com/metallb/metallb/pull/1536">#1536</a> 引入的，用于支持 LB IP 只通过部分指定网络接口广播，而非全部可用接口。</p>

<p>引入这个机制的目的，在 Issue 中有很多讨论，其中个人认为最重要的一点就是：在 K8s 集群中监听一个 Node 上的所有接口，会产生许多没有意义的日志，这些接口也包括 CNI 为每个 Pod 创建的 veth pair。但从 MetalLB 实现来看，监听所有接口属于最简单的实现，因为 MetalLB 无法感知哪个接口对现在或以后都是否有用，这部分信息可能属于用户的先验。最终此机制通过 ConfigMap 暴露为可选配置项，但在<a href="https://github.com/metallb/metallb/blob/main/design/layer2-bind-interfaces.md#motivation">提案的描述</a>中，还提到了一个监听复杂类型多接口所引发的问题，如下图所示：</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-06/metallb-int-ann.png" alt="metallb-int-ann" /></p>

<blockquote>
  <p>这个问题的大致意思是：对于所有复杂类型的接口（比如 bridge、ovs、macvlan 等），MetalLB 会从它们中接收所有 ARP 请求，并响应它们接口上所有从接口的 MAC 地址。</p>

  <p>假设有两个虚拟接口 veth0 和 veth1 分别属于不同的子网，但都是 eth0 的从接口。若 MetalLB 在<code class="language-plaintext highlighter-rouge">192.172.1.0/24</code>子网工作，并且给 LoadBalancer 类型的 Service 分配了该子网的 IP 地址（假设为<code class="language-plaintext highlighter-rouge">192.172.1.10</code>）。当客户端试图通过 IP 访问 Service 时，收到请求的可能不是 veth1 而是 veth0，因为 speaker 从所有接口广播了这个 VIP。</p>
</blockquote>

<p>本文不会对该问题进行展开分析，因为这个问题就是作为<a href="https://github.com/metallb/metallb/pull/1359#issuecomment-1121136050">提案的动机</a>出现的，而且我也没有在 Issue 中找到类似在实际场景中的事故，所以很难展开。</p>
<h4 id="responder">Responder</h4>
<p>每个接口 Responder 的对外广播都通过 Announcer 的<code class="language-plaintext highlighter-rouge">SetBalancer</code>方法触发，该方法最后会通过<code class="language-plaintext highlighter-rouge">spamLoop</code>进行一次 ARP/NDP 泛洪。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/layer2/announcer.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">Announce</span><span class="p">)</span> <span class="n">SetBalancer</span><span class="p">(</span><span class="n">name</span> <span class="kt">string</span><span class="p">,</span> <span class="n">adv</span> <span class="n">IPAdvertisement</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// name 为 Service name</span>
	<span class="c">// 向 spamCh 写入数据，触发 spamLoop 发送 ARP 响应</span>
	<span class="k">defer</span> <span class="n">a</span><span class="o">.</span><span class="n">doSpam</span><span class="p">(</span><span class="n">adv</span><span class="p">)</span>  <span class="o">---&gt;---</span> <span class="n">a</span><span class="o">.</span><span class="n">spamCh</span> <span class="o">&lt;-</span> <span class="n">adv</span>

	<span class="n">a</span><span class="o">.</span><span class="n">Lock</span><span class="p">()</span>
	<span class="k">defer</span> <span class="n">a</span><span class="o">.</span><span class="n">Unlock</span><span class="p">()</span>

	<span class="c">// 一个 Service 的 ipAdvertisement 可能会更新很多次，但只处理第一次</span>
	<span class="k">if</span> <span class="n">ipAdvertisements</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">ips</span><span class="p">[</span><span class="n">name</span><span class="p">];</span> <span class="n">ok</span> <span class="p">{</span>
		<span class="k">for</span> <span class="n">i</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">ipAdvertisements</span> <span class="p">{</span>
			<span class="k">if</span> <span class="n">adv</span><span class="o">.</span><span class="n">ip</span><span class="o">.</span><span class="n">Equal</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">ips</span><span class="p">[</span><span class="n">name</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">ip</span><span class="p">)</span> <span class="p">{</span>
				<span class="n">a</span><span class="o">.</span><span class="n">ips</span><span class="p">[</span><span class="n">name</span><span class="p">][</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">adv</span> <span class="c">// 对于已有的，覆盖原来的值，以防接口变化了</span>
				<span class="k">return</span>
			<span class="p">}</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="n">a</span><span class="o">.</span><span class="n">ips</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">ips</span><span class="p">[</span><span class="n">name</span><span class="p">],</span> <span class="n">adv</span><span class="p">)</span>

	<span class="c">// 记录该 IP 的引用次数</span>
	<span class="n">a</span><span class="o">.</span><span class="n">ipRefcnt</span><span class="p">[</span><span class="n">adv</span><span class="o">.</span><span class="n">ip</span><span class="o">.</span><span class="n">String</span><span class="p">()]</span><span class="o">++</span>

	<span class="c">// ... 执行 defer</span>
<span class="p">}</span>
</code></pre></div></div>
<p>该泛洪实质上调用<code class="language-plaintext highlighter-rouge">gratuitous</code>方法，通过使用所有规定接口对应 Responder 的<code class="language-plaintext highlighter-rouge">Gratuitous</code>方法来进行 ARP/NDP 泛洪。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">Announce</span><span class="p">)</span> <span class="n">gratuitous</span><span class="p">(</span><span class="n">adv</span> <span class="n">IPAdvertisement</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">a</span><span class="o">.</span><span class="n">RLock</span><span class="p">()</span>
	<span class="k">defer</span> <span class="n">a</span><span class="o">.</span><span class="n">RUnlock</span><span class="p">()</span>

	<span class="n">ip</span> <span class="o">:=</span> <span class="n">adv</span><span class="o">.</span><span class="n">ip</span>
	<span class="c">// 若当前 Node 对于 ip 的引用计数为 0，说明该 Node 不是进行广播的</span>
	<span class="k">if</span> <span class="n">a</span><span class="o">.</span><span class="n">ipRefcnt</span><span class="p">[</span><span class="n">ip</span><span class="o">.</span><span class="n">String</span><span class="p">()]</span> <span class="o">&lt;=</span> <span class="m">0</span> <span class="p">{</span>
		<span class="k">return</span>
	<span class="p">}</span>

	<span class="k">if</span> <span class="n">ip</span><span class="o">.</span><span class="n">To4</span><span class="p">()</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">client</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">a</span><span class="o">.</span><span class="n">arps</span> <span class="p">{</span>
			<span class="c">// 只使用与规定接口匹配的 responder 接口</span>
			<span class="k">if</span> <span class="o">!</span><span class="n">adv</span><span class="o">.</span><span class="n">matchInterface</span><span class="p">(</span><span class="n">client</span><span class="o">.</span><span class="n">intf</span><span class="p">)</span> <span class="p">{</span>
				<span class="k">continue</span>
			<span class="p">}</span>
			<span class="n">client</span><span class="o">.</span><span class="n">Gratuitous</span><span class="p">(</span><span class="n">ip</span><span class="p">)</span>
		<span class="p">}</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="c">// 至于 ipv6 类型，处理方式也同上</span>
		<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">client</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">a</span><span class="o">.</span><span class="n">ndps</span> <span class="p">{</span>
			<span class="k">if</span> <span class="o">!</span><span class="n">adv</span><span class="o">.</span><span class="n">matchInterface</span><span class="p">(</span><span class="n">client</span><span class="o">.</span><span class="n">intf</span><span class="p">)</span> <span class="p">{</span>
				<span class="k">continue</span>
			<span class="p">}</span>
			<span class="n">client</span><span class="o">.</span><span class="n">Gratuitous</span><span class="p">(</span><span class="n">ip</span><span class="p">)</span>
		<span class="p">}</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h5 id="garp-协议">G/ARP 协议</h5>
<p>ARP 模式的 Responder（ARPResp）在初始化时就向接口建立了连接，并开启 goroutine 对连接上的数据包进行读取。当然，并非所有读取到的数据包都是可用的：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/layer2/arp.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">arpResponder</span><span class="p">)</span> <span class="n">processRequest</span><span class="p">()</span> <span class="n">dropReason</span> <span class="p">{</span>
	<span class="n">pkt</span><span class="p">,</span> <span class="n">eth</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">conn</span><span class="o">.</span><span class="n">Read</span><span class="p">()</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">dropReasonError</span>
	<span class="p">}</span>

	<span class="c">// 忽略 ARP 响应</span>
	<span class="k">if</span> <span class="n">pkt</span><span class="o">.</span><span class="n">Operation</span> <span class="o">!=</span> <span class="n">arp</span><span class="o">.</span><span class="n">OperationRequest</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">dropReasonARPReply</span>
	<span class="p">}</span>

	<span class="c">// 忽略非广播型并且目的 MAC 地址为当前节点的 ARP 请求</span>
	<span class="k">if</span> <span class="o">!</span><span class="n">bytes</span><span class="o">.</span><span class="n">Equal</span><span class="p">(</span><span class="n">eth</span><span class="o">.</span><span class="n">Destination</span><span class="p">,</span> <span class="n">ethernet</span><span class="o">.</span><span class="n">Broadcast</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">bytes</span><span class="o">.</span><span class="n">Equal</span><span class="p">(</span><span class="n">eth</span><span class="o">.</span><span class="n">Destination</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">hardwareAddr</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">dropReasonEthernetDestination</span>
	<span class="p">}</span>

	<span class="c">// 忽略 Announcer 规定忽略的 ARP 请求</span>
	<span class="n">reason</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">announce</span><span class="p">(</span><span class="n">pkt</span><span class="o">.</span><span class="n">TargetIP</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">intf</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">reason</span> <span class="o">!=</span> <span class="n">dropReasonNone</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">reason</span>
	<span class="p">}</span>

	<span class="n">a</span><span class="o">.</span><span class="n">conn</span><span class="o">.</span><span class="n">Reply</span><span class="p">(</span><span class="n">pkt</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">hardwareAddr</span><span class="p">,</span> <span class="n">pkt</span><span class="o">.</span><span class="n">TargetIP</span><span class="p">)</span>  <span class="c">// 对 ARP 请求进行响应</span>
	<span class="k">return</span> <span class="n">dropReasonNone</span>
<span class="p">}</span>
</code></pre></div></div>
<p>ARPResp 在过滤 ARP 请求时，还通过执行<code class="language-plaintext highlighter-rouge">announce</code>方法完成了 Announcer 规定的几种过滤规则，其中<code class="language-plaintext highlighter-rouge">announce</code>是 ARPResp 结构体的函数指针，它在 Announcer 初始化 ARPResp 时由 Announcer 的方法<code class="language-plaintext highlighter-rouge">shouldAnnounce</code>传入。该方法丢弃了目的 IP 地址非<code class="language-plaintext highlighter-rouge">IPAdvertisements</code>内的报文，而且还忽略了当前接口非有效（响应）接口时接受到的报文。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/layer2/announcer.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">Announce</span><span class="p">)</span> <span class="n">shouldAnnounce</span><span class="p">(</span><span class="n">ip</span> <span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="n">intf</span> <span class="kt">string</span><span class="p">)</span> <span class="n">dropReason</span> <span class="p">{</span>
	<span class="n">a</span><span class="o">.</span><span class="n">RLock</span><span class="p">()</span>
	<span class="k">defer</span> <span class="n">a</span><span class="o">.</span><span class="n">RUnlock</span><span class="p">()</span>
	<span class="n">ipFound</span> <span class="o">:=</span> <span class="no">false</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">ipAdvertisements</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">a</span><span class="o">.</span><span class="n">ips</span> <span class="p">{</span>
		<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">i</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">ipAdvertisements</span> <span class="p">{</span>
			<span class="k">if</span> <span class="n">i</span><span class="o">.</span><span class="n">ip</span><span class="o">.</span><span class="n">Equal</span><span class="p">(</span><span class="n">ip</span><span class="p">)</span> <span class="p">{</span>
				<span class="n">ipFound</span> <span class="o">=</span> <span class="no">true</span>
				<span class="k">if</span> <span class="n">i</span><span class="o">.</span><span class="n">matchInterface</span><span class="p">(</span><span class="n">intf</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// 是合法的 IP 但非规定的接口</span>
					<span class="k">return</span> <span class="n">dropReasonNone</span>
				<span class="p">}</span>
			<span class="p">}</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="k">if</span> <span class="n">ipFound</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">dropReasonNotMatchInterface</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">dropReasonAnnounceIP</span>
<span class="p">}</span>
</code></pre></div></div>
<p>上述所描述的过程是 ARPResp 对外部一个 ARP 广播请求的响应，属于传统 ARP 的工作方式。但是对于 MetalLB 来说，每次 Service 的更新都可能引发 External IP 的变更，这些变更 IP 与 MAC 地址间的映射关系若不能被客户端或交换机及时的感知到（比如 ARP 缓存未及时更新），则会引发请求失败等问题，造成流量损失。</p>

<p>对此，MetalLB 采用了 ARP 的另外一种工作方式，即 Gratuitous ARP（GARP，暂译为无偿 ARP）。GARP 是一种 ARP 响应，只不过不是为响应 ARP 请求而生的，该响应本质上属于广播响应，一个典型的用处就是：<strong>用于宣告一个 host 在网络中的存在</strong>。在 GARP 的报文中，Opcode 被置为 2，表示报文类型为响应；源 MAC 和 IP 地址被置为报文发送者的地址，对应 MetalLB 中 speaker 的 IP 和 speaker 所在 Node 的 MAC 地址（具体来说是负责 IP 宣告的接口 MAC 地址）；目的 MAC 地址被置为<code class="language-plaintext highlighter-rouge">ffff.ffff.ffff</code>（或<code class="language-plaintext highlighter-rouge">0000.0000.0000</code>取决于各 ARP 的实现），表示广播报文；目的 IP 地址还是发送者的 IP 地址，用于再次确认为哪个 IP 建立 ARP 映射。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-06/garp-packet.png" alt="garp-packet" /></p>

<p>回看 ARPResp 在泛洪时对 GARP 的实现，可以发现其不仅发送了一个广播响应，还在此之前发送了一个报文内容一模一样的广播请求。关于为什么要引入一次广播请求报文？</p>

<ol>
  <li>历史原因。在早期的一些系统实现中，GARP 是以请求的方式广播的，如果只使用响应方式，那么对于一些旧系统来说，不会生效</li>
  <li>另外，使用请求方式也存在一个好处，就是一旦 GARP 请求被回复了，说明在本网段内存在第二个跟当前 IP 相同的设备，证明 IP 地址冲突了；但在 MetalLB 中并没有对这点进行处理，因为在 MetalLB 中基本上不会出现地址冲突问题</li>
</ol>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/layer2/arp.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">arpResponder</span><span class="p">)</span> <span class="n">Gratuitous</span><span class="p">(</span><span class="n">ip</span> <span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">op</span> <span class="o">:=</span> <span class="k">range</span> <span class="p">[]</span><span class="n">arp</span><span class="o">.</span><span class="n">Operation</span><span class="p">{</span><span class="n">arp</span><span class="o">.</span><span class="n">OperationRequest</span><span class="p">,</span> <span class="n">arp</span><span class="o">.</span><span class="n">OperationReply</span><span class="p">}</span> <span class="p">{</span>
		<span class="n">pkt</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">arp</span><span class="o">.</span><span class="n">NewPacket</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">hardwareAddr</span><span class="p">,</span> <span class="n">ip</span><span class="p">,</span> <span class="n">ethernet</span><span class="o">.</span><span class="n">Broadcast</span><span class="p">,</span> <span class="n">ip</span><span class="p">)</span>
		<span class="n">a</span><span class="o">.</span><span class="n">conn</span><span class="o">.</span><span class="n">WriteTo</span><span class="p">(</span><span class="n">pkt</span><span class="p">,</span> <span class="n">ethernet</span><span class="o">.</span><span class="n">Broadcast</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<h5 id="ndp-协议">NDP 协议</h5>
<p>由于 ipv6 没有 ARP，所以使用 NDP（Neighbor Discovery Protocol）协议完成 IP 地址到 MAC 地址的映射。对于 NDP 来说，其有 5 种消息类型，均使用 ICMPv6 做封装。</p>

<p>NDP 模式 Responder（NDPResp）的泛洪实现非常简单，其就是直接发送一个 Neighbor Advertisement（NA）类型的消息（ICMPv6 type 136）。但是注意，NA 类型的消息是通过一个特殊的 ipv6 多播地址<code class="language-plaintext highlighter-rouge">ff02::1</code>在链路本地范围内广播数据包的，即可以接受到该广播数据包的 Node 都应该加入到这个多播组中去。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/layer2/ndp.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">n</span> <span class="o">*</span><span class="n">ndpResponder</span><span class="p">)</span> <span class="n">Gratuitous</span><span class="p">(</span><span class="n">ip</span> <span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">err</span> <span class="o">:=</span> <span class="n">n</span><span class="o">.</span><span class="n">advertise</span><span class="p">(</span><span class="n">net</span><span class="o">.</span><span class="n">IPv6linklocalallnodes</span><span class="p">,</span> <span class="n">ip</span><span class="p">,</span> <span class="no">true</span><span class="p">)</span>  <span class="c">// 特殊的 ipv6 多播地址</span>
	<span class="k">return</span> <span class="n">err</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">n</span> <span class="o">*</span><span class="n">ndpResponder</span><span class="p">)</span> <span class="n">advertise</span><span class="p">(</span><span class="n">dst</span><span class="p">,</span> <span class="n">target</span> <span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="n">gratuitous</span> <span class="kt">bool</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">m</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">ndp</span><span class="o">.</span><span class="n">NeighborAdvertisement</span><span class="p">{</span>
		<span class="n">Solicited</span><span class="o">:</span>     <span class="o">!</span><span class="n">gratuitous</span><span class="p">,</span>
		<span class="n">Override</span><span class="o">:</span>      <span class="n">gratuitous</span><span class="p">,</span>  <span class="c">// Should clients replace existing cache entries</span>
		<span class="n">TargetAddress</span><span class="o">:</span> <span class="n">target</span><span class="p">,</span>
		<span class="n">Options</span><span class="o">:</span> <span class="p">[]</span><span class="n">ndp</span><span class="o">.</span><span class="n">Option</span><span class="p">{</span>
			<span class="o">&amp;</span><span class="n">ndp</span><span class="o">.</span><span class="n">LinkLayerAddress</span><span class="p">{</span>
				<span class="n">Direction</span><span class="o">:</span> <span class="n">ndp</span><span class="o">.</span><span class="n">Target</span><span class="p">,</span>
				<span class="n">Addr</span><span class="o">:</span>      <span class="n">n</span><span class="o">.</span><span class="n">hardwareAddr</span><span class="p">,</span>
			<span class="p">},</span>
		<span class="p">},</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">n</span><span class="o">.</span><span class="n">conn</span><span class="o">.</span><span class="n">WriteTo</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="no">nil</span><span class="p">,</span> <span class="n">dst</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>所以 NDPResp 还涉及到两个方法：<code class="language-plaintext highlighter-rouge">Watch</code>和<code class="language-plaintext highlighter-rouge">Unwatch</code>，分别被 Announcer 在<code class="language-plaintext highlighter-rouge">SetBalancer</code>和<code class="language-plaintext highlighter-rouge">DeleteBalancer</code>时调用，目的就是将对外宣告的接口加入到这个多播组中，或从该多播组中删除。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">n</span> <span class="o">*</span><span class="n">ndpResponder</span><span class="p">)</span> <span class="n">Watch</span><span class="p">(</span><span class="n">ip</span> <span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="c">// ...</span>
	<span class="n">group</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">ndp</span><span class="o">.</span><span class="n">SolicitedNodeMulticast</span><span class="p">(</span><span class="n">ip</span><span class="p">)</span>

	<span class="k">if</span> <span class="n">n</span><span class="o">.</span><span class="n">solicitedNodeGroups</span><span class="p">[</span><span class="n">group</span><span class="o">.</span><span class="n">String</span><span class="p">()]</span> <span class="o">==</span> <span class="m">0</span> <span class="p">{</span>
		<span class="n">n</span><span class="o">.</span><span class="n">conn</span><span class="o">.</span><span class="n">JoinGroup</span><span class="p">(</span><span class="n">group</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="n">n</span><span class="o">.</span><span class="n">solicitedNodeGroups</span><span class="p">[</span><span class="n">group</span><span class="o">.</span><span class="n">String</span><span class="p">()]</span><span class="o">++</span>
	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">n</span> <span class="o">*</span><span class="n">ndpResponder</span><span class="p">)</span> <span class="n">Unwatch</span><span class="p">(</span><span class="n">ip</span> <span class="n">net</span><span class="o">.</span><span class="n">IP</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="c">// ...</span>
	<span class="n">group</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">ndp</span><span class="o">.</span><span class="n">SolicitedNodeMulticast</span><span class="p">(</span><span class="n">ip</span><span class="p">)</span>

	<span class="n">n</span><span class="o">.</span><span class="n">solicitedNodeGroups</span><span class="p">[</span><span class="n">group</span><span class="o">.</span><span class="n">String</span><span class="p">()]</span><span class="o">--</span>
	<span class="k">if</span> <span class="n">n</span><span class="o">.</span><span class="n">solicitedNodeGroups</span><span class="p">[</span><span class="n">group</span><span class="o">.</span><span class="n">String</span><span class="p">()]</span> <span class="o">==</span> <span class="m">0</span> <span class="p">{</span>
		<span class="n">n</span><span class="o">.</span><span class="n">conn</span><span class="o">.</span><span class="n">LeaveGroup</span><span class="p">(</span><span class="n">group</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>与 ARPResp 一样，NDPResp 在初始化时也开启了对接口的监听，并且对请求的处理过程也大同小异。NDPResp 只接受 NS 类型的消息，在消息目的 IP 地址与接口的 IP 地址一致时，才会发送对应单播类型的 NA 消息响应。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">n</span> <span class="o">*</span><span class="n">ndpResponder</span><span class="p">)</span> <span class="n">processRequest</span><span class="p">()</span> <span class="n">dropReason</span> <span class="p">{</span>
	<span class="n">msg</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">n</span><span class="o">.</span><span class="n">conn</span><span class="o">.</span><span class="n">ReadFrom</span><span class="p">()</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">dropReasonError</span>
	<span class="p">}</span>

	<span class="c">// 只处理 NS 类型的消息</span>
	<span class="n">ns</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">msg</span><span class="o">.</span><span class="p">(</span><span class="o">*</span><span class="n">ndp</span><span class="o">.</span><span class="n">NeighborSolicitation</span><span class="p">)</span>
	<span class="k">if</span> <span class="o">!</span><span class="n">ok</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">dropReasonMessageType</span>
	<span class="p">}</span>

	<span class="c">// 提取发送者的源 MAC 地址</span>
	<span class="k">var</span> <span class="n">nsLLAddr</span> <span class="n">net</span><span class="o">.</span><span class="n">HardwareAddr</span>
        <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">o</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">ns</span><span class="o">.</span><span class="n">Options</span> <span class="p">{</span>
		<span class="n">lla</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">o</span><span class="o">.</span><span class="p">(</span><span class="o">*</span><span class="n">ndp</span><span class="o">.</span><span class="n">LinkLayerAddress</span><span class="p">)</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">ok</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="k">if</span> <span class="n">lla</span><span class="o">.</span><span class="n">Direction</span> <span class="o">!=</span> <span class="n">ndp</span><span class="o">.</span><span class="n">Source</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="n">nsLLAddr</span> <span class="o">=</span> <span class="n">lla</span><span class="o">.</span><span class="n">Addr</span>
		<span class="k">break</span>
	<span class="p">}</span>
	<span class="k">if</span> <span class="n">nsLLAddr</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">dropReasonNoSourceLL</span>
	<span class="p">}</span>

	<span class="c">// announce 方法与上文 ARP Responder 中的一样</span>
	<span class="n">reason</span> <span class="o">:=</span> <span class="n">n</span><span class="o">.</span><span class="n">announce</span><span class="p">(</span><span class="n">ns</span><span class="o">.</span><span class="n">TargetAddress</span><span class="p">,</span> <span class="n">n</span><span class="o">.</span><span class="n">intf</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">reason</span> <span class="o">!=</span> <span class="n">dropReasonNone</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">reason</span>
	<span class="p">}</span>

	<span class="n">n</span><span class="o">.</span><span class="n">advertise</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">ns</span><span class="o">.</span><span class="n">TargetAddress</span><span class="p">,</span> <span class="no">false</span><span class="p">)</span>  <span class="c">// 回复 NA 类型的消息，单播地址</span>
	<span class="c">// ...</span>
	<span class="k">return</span> <span class="n">dropReasonNone</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="failover-机制">Failover 机制</h4>
<p>Leader speaker 的故障转移过程是自动的，MetalLB 使用 <a href="https://github.com/hashicorp/memberlist">memberlist</a> 完成对故障 Node 的检测工作。有关 memberlist 的解析并非本文重点。</p>

<p>memberlist 基于 Gossip 协议广播。每个 speaker 都维护了一份成员列表 speakerlist，具体来说，由于在 MetalLB 中使用了 memberlist 的<code class="language-plaintext highlighter-rouge">DefaultLANConfig</code>模式，所以 memberlist 维护的是<strong>集群内 Node 的 hostname 列表</strong>。speakerlist 跟随 speaker 进程启动，并在后台开启了三个 goroutine 分别负责定时（每五分钟）更新 speaker pod 的 IP 列表、监听 memberlist 中的成员加入或离开事件并触发 speaker controller 的 reload（跟上文<code class="language-plaintext highlighter-rouge">reconcileService</code>中提到的向<code class="language-plaintext highlighter-rouge">reloadChan</code>写事件是一码事）、监听并定时（每一分钟）尝试将新成员的 IP 加入到 speaker pod IP 列表中。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/speakerlist/speakerlist.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">sl</span> <span class="o">*</span><span class="n">SpeakerList</span><span class="p">)</span> <span class="n">Start</span><span class="p">(</span><span class="n">client</span> <span class="o">*</span><span class="n">k8s</span><span class="o">.</span><span class="n">Client</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">sl</span><span class="o">.</span><span class="n">client</span> <span class="o">=</span> <span class="n">client</span>

	<span class="c">// 初始化 pod IP 列表，即在 metallb-system 命名空间下的 speaker pod 的 IP</span>
	<span class="n">iplist</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">sl</span><span class="o">.</span><span class="n">mlSpeakers</span><span class="p">()</span>
	<span class="n">sl</span><span class="o">.</span><span class="n">mlMux</span><span class="o">.</span><span class="n">Lock</span><span class="p">()</span>
	<span class="n">sl</span><span class="o">.</span><span class="n">mlSpeakerIPs</span> <span class="o">=</span> <span class="n">iplist</span>
	<span class="n">sl</span><span class="o">.</span><span class="n">mlMux</span><span class="o">.</span><span class="n">Unlock</span><span class="p">()</span>

	<span class="k">go</span> <span class="n">sl</span><span class="o">.</span><span class="n">updateSpeakerIPs</span><span class="p">()</span>
	<span class="k">go</span> <span class="n">sl</span><span class="o">.</span><span class="n">memberlistWatchEvents</span><span class="p">()</span>
	<span class="k">go</span> <span class="n">sl</span><span class="o">.</span><span class="n">joinMembers</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在 Leader 选举过程中用到的<code class="language-plaintext highlighter-rouge">UsableSpeakers</code>方法，其实也就是使用了 memberlist 对外提供的接口，获取当前可用的 Node 列表。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">sl</span> <span class="o">*</span><span class="n">SpeakerList</span><span class="p">)</span> <span class="n">UsableSpeakers</span><span class="p">()</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">bool</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">sl</span><span class="o">.</span><span class="n">ml</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">nil</span>
	<span class="p">}</span>
	<span class="n">activeNodes</span> <span class="o">:=</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">bool</span><span class="p">{}</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">n</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">sl</span><span class="o">.</span><span class="n">ml</span><span class="o">.</span><span class="n">Members</span><span class="p">()</span> <span class="p">{</span>  <span class="c">// memberlist method</span>
		<span class="n">activeNodes</span><span class="p">[</span><span class="n">n</span><span class="o">.</span><span class="n">Name</span><span class="p">]</span> <span class="o">=</span> <span class="no">true</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">activeNodes</span>
<span class="p">}</span>
</code></pre></div></div>
<p>实际上，speaker 的整个 L2 模式都是建立在 Failover 机制上的。如下图所示，当原有 leader speaker 下线后，memberlist 会向每个 speaker 响应一个<code class="language-plaintext highlighter-rouge">NodeLeave</code>事件。每个 speaker 在接收到事件后，都会强制触发（<code class="language-plaintext highlighter-rouge">forceReload</code>）一次全量的 Service 调谐循环。在调谐循环中，就又回到了上述 Leader 选举部分的工作，所有 speaker 都会根据 Node 的 hostname 和 Service 的 LB IP 组成的哈希值进行排序，排序结果在所有 speaker 中都是一样的，但只有当前 Node 的 hostname 与排序结果第一个一致的 speaker 才能被选举为 leader。最后由新的 leader 向所有子网内的 host 发送 GARP 报文，进行 ARP 映射关系更新。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-06/metallb-failover.png" alt="metallb-failover" /></p>

<p>可见，在 L2 模式中，性能受限的原因只可能为两个：<strong>leader speaker 所在 Node 的带宽瓶颈，以及潜在的慢故障转移</strong>。针对后者来说，一次完整且成功的故障转移，需要经过 Leader 选举、广播 GARP、neighbor 更新 ARP 缓存这几个步骤，所以会在几秒内发生（官方指出一般不会超过 10s）。</p>
<h3 id="bgp-模式">BGP 模式</h3>
<p>该模式下，所有 speaker 都会向每个（或指定的）BGP peer 去广播 Service 的 LB IP。这里所指的 BGP peer 是一类可以使用 BGP 协议的网络路由器，这些路由器包括真实的专业网络路由器，或其他任何运行了路由软件（比如 BIRD、Quagga 等）的设备。当路由器接受到请求 LB IP 的流量时，它会选出一个广播此 IP 的 speaker 所在的 Node，然后将流量转发到该 Node 上。进入到 Node 的流量会通过 kube-proxy 完成后续的转发工作，<code class="language-plaintext highlighter-rouge">ExternalTrafficPolicy</code>起到的效果与上文描述相同。</p>

<p>每当路由器接收到一次请求 LB IP 的新流量，它就会对一个 Node 建立一条新连接，具体选择哪个 Node 会因制造商或路由软件的实现而不同，但连接决策算法的目的就是实现流量的负载均衡，这也是 MetalLB 在 BGP 模式中<strong>体现负载均衡的地方</strong>。此时若有一个 Node 不可用了，路由器还会重新选择另一个 Node 并建立连接，这点也是 MetalLB 在 BGP 模式中<strong>对故障恢复机制的体现</strong>。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-06-06/metallb-bgp.png" alt="metallb-bgp.png" /></p>

<p>MetalLB 为 BGP 模式提供了两种实现类型：<code class="language-plaintext highlighter-rouge">native</code>和<code class="language-plaintext highlighter-rouge">frr</code>，由环境变量<code class="language-plaintext highlighter-rouge">METALLB_BGP_TYPE</code>指定，并在 speaker 创建 BGP controller 时初始化该类型对应的 session manager。</p>
<h4 id="native-实现">Native 实现</h4>
<h5 id="syncpeers">syncPeers</h5>
<p>在 Node 与 Config 发生更新时，都会触发与 Router 即 BGP Peer 的状态同步。该动作发生于 BGP controller 的<code class="language-plaintext highlighter-rouge">SetNode</code>和<code class="language-plaintext highlighter-rouge">SetConfig</code>方法中：</p>

<ul>
  <li>由于 speaker 运行在每个 Node 上，故当 Node 发生创建、删除、更新（标签信息）时，都可能会引发与 BGP peer 的建立或回收连接</li>
  <li>BGP peer 可由<code class="language-plaintext highlighter-rouge">BGPPeer</code>CRD 描述，故当 BGP peer 加入或移除集群时，都会引发与所有 speaker 的建立或回收连接</li>
</ul>

<p>两个方法均负责捕获这种变化，最终它们都通过调用<code class="language-plaintext highlighter-rouge">syncPeers</code>方法进行状态同步。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// speaker/bgp_controller.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">bgpController</span><span class="p">)</span> <span class="n">syncPeers</span><span class="p">(</span><span class="n">l</span> <span class="n">log</span><span class="o">.</span><span class="n">Logger</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="k">var</span> <span class="p">(</span>
		<span class="n">errs</span>          <span class="kt">int</span>
		<span class="n">needUpdateAds</span> <span class="kt">bool</span>
	<span class="p">)</span>

	<span class="c">// 遍历所有 peers，这些 peers 是当前最新的</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">p</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">c</span><span class="o">.</span><span class="n">peers</span> <span class="p">{</span>
		<span class="c">// 匹配每个 peer 上的 NodeSeletor，决定该 Node 是否对当前 peer 生效</span>
		<span class="n">shouldRun</span> <span class="o">:=</span> <span class="no">false</span>
		<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">cfg</span><span class="o">.</span><span class="n">NodeSelectors</span><span class="p">)</span> <span class="o">==</span> <span class="m">0</span> <span class="p">{</span>
			<span class="n">shouldRun</span> <span class="o">=</span> <span class="no">true</span>
		<span class="p">}</span>
		<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">ns</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">p</span><span class="o">.</span><span class="n">cfg</span><span class="o">.</span><span class="n">NodeSelectors</span> <span class="p">{</span>
			<span class="k">if</span> <span class="n">ns</span><span class="o">.</span><span class="n">Matches</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">nodeLabels</span><span class="p">)</span> <span class="p">{</span>
				<span class="n">shouldRun</span> <span class="o">=</span> <span class="no">true</span>
				<span class="k">break</span>
			<span class="p">}</span>
		<span class="p">}</span>

		<span class="c">// 若 session 非空但是 Node 已经不生效了，则关闭当前 session</span>
		<span class="k">if</span> <span class="n">p</span><span class="o">.</span><span class="n">session</span> <span class="o">!=</span> <span class="no">nil</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">shouldRun</span> <span class="p">{</span>
			<span class="n">p</span><span class="o">.</span><span class="n">session</span><span class="o">.</span><span class="n">Close</span><span class="p">()</span>  <span class="c">// ---&gt;--- conn.Close()</span>
			<span class="n">p</span><span class="o">.</span><span class="n">session</span> <span class="o">=</span> <span class="no">nil</span>
		<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="n">p</span><span class="o">.</span><span class="n">session</span> <span class="o">==</span> <span class="no">nil</span> <span class="o">&amp;&amp;</span> <span class="n">shouldRun</span> <span class="p">{</span>
			<span class="c">// 若 session 不存在但是 Node 在生效中，则创建新的 session</span>
			<span class="k">var</span> <span class="n">routerID</span> <span class="n">net</span><span class="o">.</span><span class="n">IP</span>
			<span class="k">if</span> <span class="n">p</span><span class="o">.</span><span class="n">cfg</span><span class="o">.</span><span class="n">RouterID</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
				<span class="n">routerID</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">cfg</span><span class="o">.</span><span class="n">RouterID</span>
			<span class="p">}</span>
			<span class="n">s</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">sessionManager</span><span class="o">.</span><span class="n">NewSession</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">logger</span><span class="p">,</span>  <span class="c">// 创建 session 并尝试进行连接</span>
				<span class="n">bgp</span><span class="o">.</span><span class="n">SessionParameters</span><span class="p">{</span>
					<span class="c">// ...</span>
				<span class="p">},</span>
			<span class="p">)</span>
                        <span class="n">p</span><span class="o">.</span><span class="n">session</span> <span class="o">=</span> <span class="n">s</span>
                        <span class="n">needUpdateAds</span> <span class="o">=</span> <span class="no">true</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="c">// 对于有新创建 session 的情况，需要重新发送一次广播</span>
	<span class="k">if</span> <span class="n">needUpdateAds</span> <span class="p">{</span>
		<span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">updateAds</span><span class="p">()</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>这里 session 的创建是通过 session manager 的<code class="language-plaintext highlighter-rouge">NewSession</code>方法进行的，session manager 本质上是个接口。而 session 的关闭则会直接断开连接，值得注意的是，当一个 BGP session 终止后，<strong>它可能会影响其他活跃的连接</strong>（比如用户收到<code class="language-plaintext highlighter-rouge">connection reset by peer</code>等）。这虽然取决于各 Router 的实现，但也是 MetalLB 基于 BGP 协议做负载均衡不可回避的一个问题。如果用户在有先验的前提下，可以通过 NodeSelector 限制 BGP peer 与 Node 间的连接，以减少破坏范围。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/bgp/bgp.go</span>

<span class="k">type</span> <span class="n">SessionManager</span> <span class="k">interface</span> <span class="p">{</span>
	<span class="n">NewSession</span><span class="p">(</span><span class="n">logger</span> <span class="n">log</span><span class="o">.</span><span class="n">Logger</span><span class="p">,</span> <span class="n">args</span> <span class="n">SessionParameters</span><span class="p">)</span> <span class="p">(</span><span class="n">Session</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
	<span class="n">SyncBFDProfiles</span><span class="p">(</span><span class="n">profiles</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="o">*</span><span class="n">config</span><span class="o">.</span><span class="n">BFDProfile</span><span class="p">)</span> <span class="kt">error</span>
<span class="p">}</span>
</code></pre></div></div>
<p>此处调用<code class="language-plaintext highlighter-rouge">NewSession</code>方法创建的就是一个 <strong>Native 类型的 session</strong>。session 创建的同时，还启动了两个 goroutine，一个负责创建向 BGP peer 的连接，另一个负责在连接建立成功之后定时（通过<code class="language-plaintext highlighter-rouge">BGPPeer.spec.holdTime</code>配置）向 BGP peer 发送 KEEPALIVE 消息。值得注意的是，虽然 speaker 向 BGP peer 建立的是 TCP 连接，但 MetalLB 使用了一种相对底层的方式：<strong>通过 socket 完成</strong>。这样做的原因包括：</p>

<ul>
  <li>方便写入 TCP 的 MD5 签名，<code class="language-plaintext highlighter-rouge">BGPPeer.spec.password</code>规定了在 BGP session 中使用 TCP MD5 认证
    <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/bgp/native/native.go</span>

    <span class="k">if</span> <span class="n">password</span> <span class="o">!=</span> <span class="s">""</span> <span class="p">{</span>
        <span class="n">sig</span> <span class="o">:=</span> <span class="n">buildTCPMD5Sig</span><span class="p">(</span><span class="n">raddr</span><span class="o">.</span><span class="n">IP</span><span class="p">,</span> <span class="n">password</span><span class="p">)</span>
        <span class="n">b</span> <span class="o">:=</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="p">[</span><span class="n">unsafe</span><span class="o">.</span><span class="n">Sizeof</span><span class="p">(</span><span class="n">sig</span><span class="p">)]</span><span class="kt">byte</span><span class="p">)(</span><span class="n">unsafe</span><span class="o">.</span><span class="n">Pointer</span><span class="p">(</span><span class="o">&amp;</span><span class="n">sig</span><span class="p">))</span>
        <span class="c">// fd 是与本地地址绑定的 socket，本地地址若在 BGPPeer.spec.sourceAddress 中没有指定，则使用 0:0:0:0（或 ipv6 的 [::]），表示所有可用地址</span>
        <span class="k">if</span> <span class="n">err</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">NewSyscallError</span><span class="p">(</span><span class="s">"setsockopt"</span><span class="p">,</span> <span class="n">unix</span><span class="o">.</span><span class="n">SetsockoptString</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">unix</span><span class="o">.</span><span class="n">IPPROTO_TCP</span><span class="p">,</span> <span class="n">tcpMD5SIG</span><span class="p">,</span> <span class="kt">string</span><span class="p">(</span><span class="n">b</span><span class="p">[</span><span class="o">:</span><span class="p">])));</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
            <span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
        <span class="p">}</span>
    <span class="p">}</span>
</code></pre></div>    </div>
  </li>
  <li>可以基于 Epoll 完成对连接建立成功事件的轮询机制，并配合 Context 完成对连接建立的超时等待
    <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="n">fi</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">NewFile</span><span class="p">(</span><span class="kt">uintptr</span><span class="p">(</span><span class="n">fd</span><span class="p">),</span> <span class="s">""</span><span class="p">)</span>

  <span class="n">epfd</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">EpollCreate1</span><span class="p">(</span><span class="n">syscall</span><span class="o">.</span><span class="n">EPOLL_CLOEXEC</span><span class="p">)</span>
  <span class="n">events</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="n">unix</span><span class="o">.</span><span class="n">EpollEvent</span><span class="p">,</span> <span class="m">1</span><span class="p">)</span>
  <span class="n">event</span><span class="o">.</span><span class="n">Events</span> <span class="o">=</span> <span class="n">syscall</span><span class="o">.</span><span class="n">EPOLLIN</span> <span class="o">|</span> <span class="n">syscall</span><span class="o">.</span><span class="n">EPOLLOUT</span> <span class="o">|</span> <span class="n">syscall</span><span class="o">.</span><span class="n">EPOLLPRI</span>
  <span class="n">event</span><span class="o">.</span><span class="n">Fd</span> <span class="o">=</span> <span class="kt">int32</span><span class="p">(</span><span class="n">fd</span><span class="p">)</span>
  <span class="n">unix</span><span class="o">.</span><span class="n">EpollCtl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">syscall</span><span class="o">.</span><span class="n">EPOLL_CTL_ADD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">)</span>

  <span class="k">for</span> <span class="p">{</span>
    <span class="n">timeout</span> <span class="o">:=</span> <span class="kt">int</span><span class="p">(</span><span class="o">-</span><span class="m">1</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">deadline</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">ctx</span><span class="o">.</span><span class="n">Deadline</span><span class="p">();</span> <span class="n">ok</span> <span class="p">{</span>
          <span class="c">// timeout 处理</span>
    <span class="p">}</span>
    <span class="n">nevents</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">EpollWait</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">events</span><span class="p">,</span> <span class="n">timeout</span><span class="p">)</span>
    <span class="n">nerr</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">GetsockoptInt</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">unix</span><span class="o">.</span><span class="n">SOL_SOCKET</span><span class="p">,</span> <span class="n">unix</span><span class="o">.</span><span class="n">SO_ERROR</span><span class="p">)</span>

    <span class="c">// socket 状态处理，建立成功的话就返回：net.FileConn(fi)</span>
  <span class="p">}</span>
</code></pre></div>    </div>
  </li>
</ul>

<p>BGP 协议规定：当连接建立成功后，对端各自都要发送一个 OPEN 消息（bgp_hdr_type=1），若该消息成功被接受，则需要各自回复一个 KEEPALIVE 消息（bgp_hdr_type=4）。在 MetalLB 中，这些工作在连接建立成功后就立马进行了，并开启了一个 goroutine <code class="language-plaintext highlighter-rouge">consumeBGP</code>用于消费 BGP peer 发来的消息（只接受不回复）。至此，Node 与 BGP peer 间成功建立连接并开启 session。</p>
<h5 id="updateads">updateAds</h5>
<p>除了上述“在<code class="language-plaintext highlighter-rouge">syncPeers</code>结束时，若本次同步涉及新的 session 创建，则调用<code class="language-plaintext highlighter-rouge">updateAds</code>方法进行 LB IP 的广播”之外；每当 Service 资源发生变化时，也会使用此方法进行广播。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// speaker/bgp_controller.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">bgpController</span><span class="p">)</span> <span class="n">updateAds</span><span class="p">()</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="k">var</span> <span class="n">allAds</span> <span class="p">[]</span><span class="o">*</span><span class="n">bgp</span><span class="o">.</span><span class="n">Advertisement</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">ads</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">c</span><span class="o">.</span><span class="n">svcAds</span> <span class="p">{</span>
		<span class="n">allAds</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">allAds</span><span class="p">,</span> <span class="n">ads</span><span class="o">...</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">peer</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">c</span><span class="o">.</span><span class="n">peers</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">peer</span><span class="o">.</span><span class="n">session</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="c">// 针对已建立 session 的 peer 进行 IP 广播</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">peer</span><span class="o">.</span><span class="n">session</span><span class="o">.</span><span class="n">Set</span><span class="p">(</span><span class="n">allAds</span><span class="o">...</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span> <span class="n">err</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Service 的变化通过 speaker controller 的<code class="language-plaintext highlighter-rouge">SetBalancer</code>方法感知，之后会经由与 L2 模式一样的步骤：</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">bgpController.ShouldAnnounce</code>根据 Node 是否在地址池中，以及<code class="language-plaintext highlighter-rouge">ExternalTrafficPolicy</code>的不同决定该 Node 是否进行广播</li>
  <li><code class="language-plaintext highlighter-rouge">bgpController.SetBalancer</code>负责遍历 Service 的每个 LB IP，并为其创建<code class="language-plaintext highlighter-rouge">bgp.Advertisement</code>结构，该结构记录了一个 IP 的对端 peers 信息</li>
</ul>

<p><code class="language-plaintext highlighter-rouge">updateAds</code>方法会向所有已建立 session 的 peer 发送所有 LB IP 的广播，当然有许多 IP 根本不是当前 peer 负责的，这个也会在各自 peer 的 session 中进行过滤：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/bgp/native/native.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">session</span><span class="p">)</span> <span class="n">Set</span><span class="p">(</span><span class="n">advs</span> <span class="o">...*</span><span class="n">bgp</span><span class="o">.</span><span class="n">Advertisement</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">s</span><span class="o">.</span><span class="n">mu</span><span class="o">.</span><span class="n">Lock</span><span class="p">()</span>
	<span class="k">defer</span> <span class="n">s</span><span class="o">.</span><span class="n">mu</span><span class="o">.</span><span class="n">Unlock</span><span class="p">()</span>

	<span class="n">newAdvs</span> <span class="o">:=</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="o">*</span><span class="n">bgp</span><span class="o">.</span><span class="n">Advertisement</span><span class="p">{}</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">adv</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">advs</span> <span class="p">{</span>
		<span class="c">// 遍历该 IP 对应的所有 peers，看当前 peer 是否在其中，若在则匹配</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">adv</span><span class="o">.</span><span class="n">MatchesPeer</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">SessionName</span><span class="p">)</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="c">// 目前只能广播 ipv4 类型的 IP 地址</span>
		<span class="n">err</span> <span class="o">:=</span> <span class="n">validate</span><span class="p">(</span><span class="n">adv</span><span class="p">)</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span> <span class="n">err</span>
		<span class="p">}</span>
		<span class="n">newAdvs</span><span class="p">[</span><span class="n">adv</span><span class="o">.</span><span class="n">Prefix</span><span class="o">.</span><span class="n">String</span><span class="p">()]</span> <span class="o">=</span> <span class="n">adv</span>
	<span class="p">}</span>

	<span class="n">s</span><span class="o">.</span><span class="nb">new</span> <span class="o">=</span> <span class="n">newAdvs</span>
	<span class="n">s</span><span class="o">.</span><span class="n">cond</span><span class="o">.</span><span class="n">Broadcast</span><span class="p">()</span>

	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>最后的条件变量<code class="language-plaintext highlighter-rouge">cond.Broadcast()</code>会通过<code class="language-plaintext highlighter-rouge">sendUpdate</code>或<code class="language-plaintext highlighter-rouge">sendWithdraw</code>触发 BGP 协议 <a href="https://datatracker.ietf.org/doc/html/rfc1654#section-4.3">UPDATE 消息</a>（bgp_hdr_type=2）的发送，消息中含有要增加或删除的 LB IP 的路由。</p>
<h4 id="frr-实现">FRR 实现</h4>
<p>MetalLB 除了上述的 Native 方式实现，还支持 FRR 方式的实现。<a href="https://frrouting.org/">FRR</a> 是个基于 Linux 的强大路由开源软件，它支持各种路由协议，MetalLB 就使用了其 BGP 协议的实现。如果启用 FRR 模式，BGP session 将支持 BFD、支持 ipv6，MetalLB 也会支持各种其他路由协议的实现（比如 RIP、OSPF 等）。</p>
<h5 id="配合方式">配合方式</h5>
<p>在实现上，FRR 是作为一个额外的容器出现在 speaker 的 Pod 中。speaker 容器通过写配置文件的方式完成对 FRR 容器的控制，配置文件的内容是 frr session manager 根据 BGP 的配置来编写的（详见<code class="language-plaintext highlighter-rouge">createConfig</code>方法），生成的配置会写入 manager 的<code class="language-plaintext highlighter-rouge">reloadConfig</code>通道。通道的另一端是一个负责读取并将配置写入到文件的 goroutine。引发配置写入通道的时机有很多，包括：每次 session 的创建与关闭、以及 session 进行 IP 广播时。所以配置文件的 I/O 读写能力一定程度上成为了 FRR 模式的性能瓶颈，为避免此问题，MetalLB 和 Istio 类似，都<strong>采用了一种 debounce 机制</strong>：即对于一个新配置而言，不立马进行文件写入，而是等待 3s（不可配置），将此段时间内的所有配置“压缩为”一个请求写入到文件。</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/bgp/frr/config.go</span>

<span class="k">func</span> <span class="n">debouncer</span><span class="p">(</span><span class="n">body</span> <span class="k">func</span><span class="p">(</span><span class="n">config</span> <span class="o">*</span><span class="n">frrConfig</span><span class="p">)</span> <span class="kt">error</span><span class="p">,</span> <span class="n">reload</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="n">reloadEvent</span><span class="p">,</span> <span class="n">reloadInterval</span> <span class="n">time</span><span class="o">.</span><span class="n">Duration</span><span class="p">,</span> <span class="n">failureRetryInterval</span> <span class="n">time</span><span class="o">.</span><span class="n">Duration</span><span class="p">,</span> <span class="n">l</span> <span class="n">log</span><span class="o">.</span><span class="n">Logger</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">go</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
		<span class="k">var</span> <span class="n">config</span> <span class="o">*</span><span class="n">frrConfig</span>
		<span class="k">var</span> <span class="n">timeOut</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="n">time</span><span class="o">.</span><span class="n">Time</span>
		<span class="n">timerSet</span> <span class="o">:=</span> <span class="no">false</span>
		<span class="k">for</span> <span class="p">{</span>
			<span class="k">select</span> <span class="p">{</span>
			<span class="k">case</span> <span class="n">newCfg</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="o">&lt;-</span><span class="n">reload</span><span class="o">:</span>
				<span class="k">if</span> <span class="o">!</span><span class="n">ok</span> <span class="p">{</span> <span class="c">// the channel was closed</span>
					<span class="k">return</span>
				<span class="p">}</span>
				<span class="k">if</span> <span class="n">newCfg</span><span class="o">.</span><span class="n">useOld</span> <span class="o">&amp;&amp;</span> <span class="n">config</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>  <span class="c">// useOld 字段由配置的定时验证方法进行设置，若配置出现任何问题，则该字段为 true</span>
					<span class="k">continue</span>
				<span class="p">}</span>
				<span class="k">if</span> <span class="o">!</span><span class="n">newCfg</span><span class="o">.</span><span class="n">useOld</span> <span class="o">&amp;&amp;</span> <span class="n">reflect</span><span class="o">.</span><span class="n">DeepEqual</span><span class="p">(</span><span class="n">newCfg</span><span class="o">.</span><span class="n">config</span><span class="p">,</span> <span class="n">config</span><span class="p">)</span> <span class="p">{</span>  <span class="c">// 忽略配置不变的请求</span>
					<span class="k">continue</span>
				<span class="p">}</span>
				<span class="k">if</span> <span class="o">!</span><span class="n">newCfg</span><span class="o">.</span><span class="n">useOld</span> <span class="p">{</span>
					<span class="n">config</span> <span class="o">=</span> <span class="n">newCfg</span><span class="o">.</span><span class="n">config</span>  <span class="c">// 压缩配置的方法很粗暴，就是直接使用该时间段内最新的配置</span>
				<span class="p">}</span>
				<span class="k">if</span> <span class="o">!</span><span class="n">timerSet</span> <span class="p">{</span>  <span class="c">// 设置等待时间</span>
					<span class="n">timeOut</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">After</span><span class="p">(</span><span class="n">reloadInterval</span><span class="p">)</span>
					<span class="n">timerSet</span> <span class="o">=</span> <span class="no">true</span>
				<span class="p">}</span>
			<span class="k">case</span> <span class="o">&lt;-</span><span class="n">timeOut</span><span class="o">:</span>
				<span class="n">err</span> <span class="o">:=</span> <span class="n">body</span><span class="p">(</span><span class="n">config</span><span class="p">)</span>  <span class="c">// 写入 FRR 配置文件</span>
				<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>  <span class="c">// 若出现错误则进行重试</span>
					<span class="n">timeOut</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">After</span><span class="p">(</span><span class="n">failureRetryInterval</span><span class="p">)</span>  <span class="c">// 重试间隔 5s，不可配置</span>
					<span class="n">timerSet</span> <span class="o">=</span> <span class="no">true</span>
					<span class="k">continue</span>
				<span class="p">}</span>
				<span class="n">timerSet</span> <span class="o">=</span> <span class="no">false</span>
			<span class="p">}</span>
		<span class="p">}</span>
	<span class="p">}()</span>
<span class="p">}</span>
</code></pre></div></div>
<p>配置文件写入成功后，至此 BGP 的能力（包括负载均衡、故障转移等）就完全交付给了 FRR。有关 FRR 如何实现 BGP 并非本文关注点，感兴趣可<a href="http://docs.frrouting.org/en/latest/bgp.html">参考此文档</a>。</p>
<h5 id="快速故障检测">快速故障检测</h5>
<p>开启 FRR 模式的另一个好处就是可以在 BGP session 中使用 BFD 协议。在 Native 实现中，<code class="language-plaintext highlighter-rouge">holdTime</code>规定了一个失败 session 所存活的时间，该时间越小，故障检测的速度就越快，但这个时间值规定最低为 3s，所以对于一些极其依赖快速检测的场景来说，时间还是太长了。而 BFD 协议提供了一种能双向快速检测故障的方法，可以<strong>将故障检测的时长降低至亚秒级</strong>。</p>

<p>MetalLB 使用了 FRR 提供的 BFD 实现，并提供了一个<code class="language-plaintext highlighter-rouge">BFDProfile</code> CR，用于暴露 BFD 的配置。当开启 FRR 方式后，bgp controller 除了会触发<code class="language-plaintext highlighter-rouge">syncPeers</code>进行状态同步，还会调用<code class="language-plaintext highlighter-rouge">syncBFDProfiles</code>方法将<code class="language-plaintext highlighter-rouge">BFDProfile</code>翻译为 FRR 配置文件：</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// internal/bgp/frr/frr.go</span>

<span class="k">func</span> <span class="p">(</span><span class="n">sm</span> <span class="o">*</span><span class="n">sessionManager</span><span class="p">)</span> <span class="n">SyncBFDProfiles</span><span class="p">(</span><span class="n">profiles</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="o">*</span><span class="n">metallbconfig</span><span class="o">.</span><span class="n">BFDProfile</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">sm</span><span class="o">.</span><span class="n">Lock</span><span class="p">()</span>
	<span class="k">defer</span> <span class="n">sm</span><span class="o">.</span><span class="n">Unlock</span><span class="p">()</span>
	<span class="n">sm</span><span class="o">.</span><span class="n">bfdProfiles</span> <span class="o">=</span> <span class="nb">make</span><span class="p">([]</span><span class="n">BFDProfile</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">p</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">profiles</span> <span class="p">{</span>
		<span class="n">frrProfile</span> <span class="o">:=</span> <span class="n">configBFDProfileToFRR</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>  <span class="c">// CR 翻译为 FRR 配置</span>
		<span class="n">sm</span><span class="o">.</span><span class="n">bfdProfiles</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">sm</span><span class="o">.</span><span class="n">bfdProfiles</span><span class="p">,</span> <span class="o">*</span><span class="n">frrProfile</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="n">sort</span><span class="o">.</span><span class="n">Slice</span><span class="p">(</span><span class="n">sm</span><span class="o">.</span><span class="n">bfdProfiles</span><span class="p">,</span> <span class="k">func</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="kt">int</span><span class="p">)</span> <span class="kt">bool</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">sm</span><span class="o">.</span><span class="n">bfdProfiles</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">Name</span> <span class="o">&lt;</span> <span class="n">sm</span><span class="o">.</span><span class="n">bfdProfiles</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="o">.</span><span class="n">Name</span>
	<span class="p">})</span>

	<span class="n">frrConfig</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">sm</span><span class="o">.</span><span class="n">createConfig</span><span class="p">()</span>  <span class="c">// 根据当前 manager 的状态生成一份最新的配置文件</span>
	<span class="n">sm</span><span class="o">.</span><span class="n">reloadConfig</span> <span class="o">&lt;-</span> <span class="n">reloadEvent</span><span class="p">{</span><span class="n">config</span><span class="o">:</span> <span class="n">frrConfig</span><span class="p">}</span>  <span class="c">// 写入配置通道，之后完成写入配置</span>
	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="总结">总结</h2>
<p>MetalLB 的两个组件：controller 和  speaker，都是标准的 K8s controller 实现。其中 controller 组件负责地址分配，对 Service 资源进行 External IP 的分配和回收。个人认为<strong>地址池的多租户模式</strong>和<strong>IP 地址的共享机制</strong>是最能体现 MetalLB 地址管理灵活性的两个点，当然也不否认这对代码复杂度的影响。另外，从 controller 组件中 Allocator 的代码实现上来看，它基本上每个对外方法都是具备幂等性的，这对于需要频繁验证或更新数据的场景来说，是一个很鲁棒、很重要的性质。</p>

<p>外部广播由 speaker 组件负责，其兼顾了二层（ARP 和 NDP）及三层（BGP）协议。很有意思的是，<strong>MetalLB 作为一个负载均衡器并没直接实现负载均衡</strong>，在 L2 模式中通过故障恢复实现了 LB IP 的高可用，最终负载均衡能力还是由 kube-proxy 承担；在 L3 模式中则是通过 BGP 路由软件的实现来做负载均衡。所以与其说 MetalLB 是一个负载均衡器，不如说 MetalLB 只是充当了各协议间的“粘合剂”。</p>

<p>MetalLB 可直接部署在 K8s 裸机集群中。它最初由 Google 团队在 2017 年开发，于 2021 年<a href="https://github.com/cncf/toc/issues/720">成为 CNCF Sandbox 项目</a>。MetalLB 正如本文解析的那样，本身并无神秘感；最值得探究的，反而是 MetalLB 所使用的这些网络协议，针对此点，本文浅尝辄止。</p>

<h2 id="reference">Reference</h2>

<ol>
  <li><a href="https://metallb.universe.tf/">https://metallb.universe.tf/</a></li>
  <li><a href="https://github.com/metallb/metallb/blob/main/design/pool-configuration.md">https://github.com/metallb/metallb/blob/main/design/pool-configuration.md</a></li>
  <li><a href="https://github.com/metallb/metallb/blob/main/design/layer2-bind-interfaces.md">https://github.com/metallb/metallb/blob/main/design/layer2-bind-interfaces.md</a></li>
  <li><a href="https://github.com/metallb/metallb/blob/main/design/0001-frr.md">https://github.com/metallb/metallb/blob/main/design/0001-frr.md</a></li>
  <li><a href="https://github.com/metallb/metallb/blob/main/design/bgp-bfd.md">https://github.com/metallb/metallb/blob/main/design/bgp-bfd.md</a></li>
  <li><a href="https://www.practicalnetworking.net/series/arp/gratuitous-arp/">https://www.practicalnetworking.net/series/arp/gratuitous-arp/</a></li>
  <li><a href="https://datatracker.ietf.org/doc/html/rfc5227#section-3">https://datatracker.ietf.org/doc/html/rfc5227#section-3</a></li>
  <li><a href="https://datatracker.ietf.org/doc/html/rfc1654">https://datatracker.ietf.org/doc/html/rfc1654</a></li>
  <li><a href="https://datatracker.ietf.org/doc/html/rfc5880">https://datatracker.ietf.org/doc/html/rfc5880</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Address_Resolution_Protocol#ARP_announcements">https://en.wikipedia.org/wiki/Address_Resolution_Protocol#ARP_announcements</a></li>
  <li><a href="http://linux-ip.net/html/ether-arp.html#ex-ether-arp-gratuitous">http://linux-ip.net/html/ether-arp.html#ex-ether-arp-gratuitous</a></li>
  <li><a href="https://www.networkacademy.io/ccna/ipv6/neighbor-discovery-protocol">https://www.networkacademy.io/ccna/ipv6/neighbor-discovery-protocol</a></li>
  <li><a href="https://cloud.redhat.com/blog/metallb-in-bgp-mode">https://cloud.redhat.com/blog/metallb-in-bgp-mode</a></li>
  <li><a href="https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/html/networking/load-balancing-with-metallb">https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/html/networking/load-balancing-with-metallb</a></li>
</ol>]]></content><author><name>Your Name</name><email>shawnhxh@outlook.com</email></author><category term="post" /><category term="Network" /><category term="Kubernetes" /><summary type="html"><![CDATA[本文代码基于 MetalLB v0.13.9 展开。 MetalLB 是一个基于标准路由协议的，用于裸机（bare-metal）k8s 集群的负载均衡器。这里裸机是指，直接部署的 k8s 集群并不能使用 LoadBalancer 类型的 Service，因为它没有提供一种负载均衡器的实现，只有在一些云服务 IaaS 平台（例如 AWS、GCP 等）上才能使用。 MetalLB 从两个方面实现了这么一个负载均衡器：地址分配（Address Allocation）和外部广播（External Announcement）。 地址分配 类似于各种云厂商的实现，对每个向负载均衡器的请求分配 IP 地址。MetalLB 则负责在裸机集群中分配 IP 地址，这个 IP 地址是从预先配置的地址池（AddressPool）中获取的；同样当 Service 被删除后，MetalLB 也负责回收该地址。 核心方法 reconcileService 此方法是 service-controller 的调协方法，位于 MetalLB 的 controller 组件中，负责监听所有类型的 Service，然后对它们的 IP 地址进行管理（分配或回收）。 // internal/k8s/controllers/service_controller.go func (r *ServiceReconciler) reconcileService(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // ... var service *v1.Service // 根据 Endpoint 提供的 NamespacedName 对象寻找对应的 Service 对象 service, err := r.serviceFor(ctx, req.NamespacedName) if err != nil { \ return ctrl.Result{}, err \ } --&gt;-- r.Get(ctx, name, &amp;res) // 若 MetalLB 的配置文件中指定了 LoadBalancerClass，则比对它和 Service 的是否一致 // 只有一致或无指定配置时才可通过，默认情况下，配置文件不指定该字段 if filterByLoadBalancerClass(service, r.LoadBalancerClass) { return ctrl.Result{}, nil } // 根据 Service 获取其所代理的 Endpoints 或 EndpointSlice epSlices, err := epsOrSlicesForServices(ctx, r, req.NamespacedName, r.Endpoints) if err != nil { return ctrl.Result{}, err } // 此时根据 Service 是否为空，可以判断出此次调谐是对 Service 的删除还是更新 // 对 Service 进行处理，包括 IP 地址的分配和回收 res := r.Handler(r.Logger, req.NamespacedName.String(), service, epSlices) switch res { case SyncStateError: return ctrl.Result{}, retryError case SyncStateReprocessAll: // 重新进行全量的调谐 r.forceReload() return ctrl.Result{}, nil case SyncStateErrorNoRetry: return ctrl.Result{}, nil } return ctrl.Result{}, nil }]]></summary></entry><entry><title type="html">Envoy 中的 Internal Listener</title><link href="https://shawnh2.github.io/post/2023/05/25/envoy-internal-listener.html" rel="alternate" type="text/html" title="Envoy 中的 Internal Listener" /><published>2023-05-25T00:00:00+08:00</published><updated>2023-05-25T00:00:00+08:00</updated><id>https://shawnh2.github.io/post/2023/05/25/envoy-internal-listener</id><content type="html" xml:base="https://shawnh2.github.io/post/2023/05/25/envoy-internal-listener.html"><![CDATA[<p>Envoy 支持用户态的 socket，而且在 Enovy 中，用于接受用户态连接的 listener 被称为 internal listener。internal listener 一般用于接受来自 Envoy 内部的连接，例如从 upstream cluster 接受连接请求并建立 TCP 流。使用 internal listener 时，必须将它的 name 作为一个 upstream cluster 的 endpoint 地址。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-05-25/envoy-il-base.png" alt="envoy-il-base" /></p>

<!--more-->

<p>而且在 Envoy 的配置中，也需在<code class="language-plaintext highlighter-rouge">bootstrap_extensions</code>中指定使用 internal listener：</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">bootstrap_extensions</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">envoy.bootstrap.internal_listener</span>
  <span class="na">typed_config</span><span class="pi">:</span>
    <span class="s2">"</span><span class="s">@type"</span><span class="err">:</span> <span class="s">type.googleapis.com/envoy.extensions.bootstrap.internal_listener.v3.InternalListener</span>
</code></pre></div></div>
<p>为了避免在同一个 upstream cluster 中有多个 endpoints 引用了同一个 internal listener，可设置<code class="language-plaintext highlighter-rouge">clusters[i].load_assignment.endpoints[j].lb_endpoints[k].endpoint.address.endpoint_id</code>字段来增强辨识度。该字段与 internal listener name 的组合可唯一确定一个 endpoint。</p>

<h2 id="chaining-proxies">Chaining proxies</h2>
<p><a href="https://github.com/envoyproxy/envoy/blob/c2ae2211196a48b12d2e36d00c6c2889ae2f434a/configs/internal_listener_proxy.yaml">Envoy 有个示例</a>，可以将内部的两个 TCP 代理通过 internal listener，实现把连接转发到不同的端口上。如下图所示，在 9999 端口的 TCP 连接被转发到了 10000 端口上。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-05-25/envoy-il-chain-proxy.png" alt="envoy-il-chain-proxy" /></p>
<h2 id="encapsulate-http-get-in-connect">Encapsulate HTTP GET in CONNECT</h2>
<p>Envoy 引入 internal listener 的一个原因就是：HCM 不能在 upstream 的 HTTP CONNECT 请求中代理  HTTP GET 请求，即不支持直接将 downstream 的 HTTP 请求通过 HTTP CONNECT 转发给 upstream。故需要 internal listener 这样一个中间角色来做中转。</p>

<p>Envoy 同样也提供了<a href="https://github.com/envoyproxy/envoy/blob/c2ae2211196a48b12d2e36d00c6c2889ae2f434a/configs/encapsulate_http_in_http2_connect.yaml">一个示例</a>。如下图所示，对于所有来自 10000 端口的 HTTP 请求，将其封装至一个 HTTP CONNECT 请求之中，发送到上游 10001 端口。internal listener 中配置了 TcpProxy 的<code class="language-plaintext highlighter-rouge">tunneling_config</code>，表示 TcpProxy 将同 upstream 建立一个 HTTP 隧道，而隧道采用的具体协议由 upstream cluster 指定。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-05-25/envoy-il-encap.png" alt="envoy-il-encap" /></p>

<blockquote>
  <p>上图将 Endpoint 画到了 Enovy 的外面，是用于指示这个 Endpoint 是一个真实的、可以提供服务的 Endpoint；而前面那些图中的 Endpoint 画在了 Enovy 内部，只是用于指示它是 cluster 的一个字段而已。</p>
</blockquote>

<p>这种使用 internal listener 来建立 CONNECT 隧道的方式，相当于是将 internal listener 作为了隧道的客户端。</p>
<h2 id="decapsulate-http-connect">Decapsulate HTTP CONNECT</h2>
<p>与上述的示例相呼应，对于一个 GET-in-CONNECT 请求，若要解析 CONNECT 中的 GET，也需要两个 HCM，一个用于从 CONNECT 请求中提取 TCP 流并将其重定向到另一个 HCM，另一个 HCM 负责解析 GET 请求。Enovy 同样提供了<a href="https://github.com/envoyproxy/envoy/blob/5b270c2f2a14ea4eac609bf855edcb8c051c2a39/configs/terminate_http_in_http2_connect.yaml">示例配置</a>，如下图所示。</p>

<p>其中，第一个 HCM 需要配置<code class="language-plaintext highlighter-rouge">upgrade_type: CONNECT</code>，表示支持 CONNECT 隧道，并配置<code class="language-plaintext highlighter-rouge">http2_protocol_options</code>表示使用 HTTP/2 协议。internal listener 从隧道中获取 TCP 流解析出 HTTP GET 请求，并直接返回一个 HTTP 200 响应。可见此时，internal listener 作为了隧道的服务端。</p>

<p><img src="https://raw.githubusercontent.com/shawnh2/shawnh2.github.io/master/_posts/img/2023-05-25/envoy-il-decap.png" alt="envoy-il-decap" /></p>

<p>如果结合 Envoy Encapsulate 和 Decapsulate 两种部署方式，采用两个 Envoy 来作为 HTTP CONNECT 隧道的两端，即可以得到一个端到端的 HTTP CONNECT 隧道。</p>
<h2 id="reference">Reference</h2>

<ol>
  <li><a href="https://www.envoyproxy.io/docs/envoy/latest/configuration/other_features/internal_listener">https://www.envoyproxy.io/docs/envoy/latest/configuration/other_features/internal_listener</a></li>
  <li><a href="https://www.zhaohuabing.com/post/2022-09-11-ambient-deep-dive-1/">https://www.zhaohuabing.com/post/2022-09-11-ambient-deep-dive-1/</a></li>
</ol>]]></content><author><name>Your Name</name><email>shawnhxh@outlook.com</email></author><category term="post" /><category term="Network" /><category term="Envoy" /><summary type="html"><![CDATA[Envoy 支持用户态的 socket，而且在 Enovy 中，用于接受用户态连接的 listener 被称为 internal listener。internal listener 一般用于接受来自 Envoy 内部的连接，例如从 upstream cluster 接受连接请求并建立 TCP 流。使用 internal listener 时，必须将它的 name 作为一个 upstream cluster 的 endpoint 地址。]]></summary></entry></feed>