networkop on networkop

Linux Networking - Source IP address selection

Sat, 02 Sep 2023 00:00:00 +0000

Any network device, be it a transit router or a host, usually has multiple IP addresses assigned to its interfaces. One of the first things we learn as network engineers is how to determine which IP address is used for the locally-sourced traffic. However, the default scenario can be changed in a couple of different ways and this post is a brief documentation of the available options.

The Default Scenario

Whenever a local application decides to connect to a remote network endpoint, it creates a network socket, providing a minimal amount of details required to build and send a network packet. Most often, this information includes a destination IP and port number as you can see from the following abbreviated output:

$ strace -e trace=network curl http://example.com
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 6
setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(6, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(6, SOL_TCP, TCP_KEEPIDLE, [60], 4) = 0
setsockopt(6, SOL_TCP, TCP_KEEPINTVL, [60], 4) = 0
connect(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("93.184.216.34")}, 16)

While this output does not show the DNS resolution part (due to getaddrinfo() not being a syscall), we can see that the only user-specific input information provided by an application (curl) in the connect() syscall are the remote socket port sin_port and IP address sin_adddr.

What happens next is what we all learned to expect from any operating system, not just Linux:

Destination IP is looked up in the local routing table.
The resulting route is used to determine the egress interface.
The IP of that interface is assigned as the source address for the TCP socket.

This is a sane default that picks an IP address that is most likely to reach the destination, since it’s assigned to an egress interface.

User-provided IP

In some scenarios, when multiple local IPs are reachable outside of the host, users may want to override the default behaviour. A very common use case is to specify an IP address (or interface name) as the traffic source. The following strace output looks exactly the same as above, with one notable exception:

$ strace -e trace=network curl --interface lo http://example.com
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(5, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(5, SOL_TCP, TCP_KEEPIDLE, [60], 4) = 0
setsockopt(5, SOL_TCP, TCP_KEEPINTVL, [60], 4) = 0
setsockopt(5, SOL_SOCKET, SO_BINDTODEVICE, "lo\0", 3) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("93.184.216.34")}, 16)

The setsockopt() syscall allows clients to bind to a specific interface name using the SO_BINDTODEVICE option.

Another alternative would be bind() the client socket to a specific IP address (192.0.2.2 is one of the IPs on lo interface), which is what curl does in the following case:

$ strace -e trace=network curl --interface 192.0.2.2 http://example.com
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(5, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(5, SOL_TCP, TCP_KEEPIDLE, [60], 4) = 0
setsockopt(5, SOL_TCP, TCP_KEEPINTVL, [60], 4) = 0
setsockopt(5, SOL_SOCKET, SO_BINDTODEVICE, "192.0.2.2\0", 10) = -1 ENODEV (No such device)
bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.0.2.2")}, 16) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("93.184.216.34")}, 16)

The problem with the above options is that they are application-specific and, thus, require explicit user configuration. While this may work for a small number of applications, in some scenarios it may be easier to have a global setting that would influence this behaviour.

Netlink Route Source IP

Another available option, that is frequently used on L3 multi-homed network hosts, is the rtnetlink’s src option or RTA_PREFSRC. Continuing from the previous example, let’s add a static route for the example.com and specify the src option with the loopback IP:

$ ip route add 93.184.216.34 via 172.20.20.1 src 192.0.2.2
$ ip route get 93.184.216.34
93.184.216.34 via 172.20.20.1 dev eth0 src 192.0.2.2 uid 0

Now we can re-run the original curl command without specifying the source IP:

$ tcpdump -enni eth0 host 93.184.216.34 &
$ strace -e trace=network curl http://example.com
...
connect(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("93.184.216.34")}, 16)
14:19:00.970631 IP 192.0.2.2.33068 > 93.184.216.34.80: Flags [S]

The resulting packet source IP has been changed by the kernel to the IP specified in the ip route add command above. This option can also be configured by an IP routing daemon, for example, FRR’s route-map set src command or Bird’s krt_prefsrc configuration option.

Network Automation with CUE - Working with YANG-based APIs

Wed, 07 Dec 2022 00:00:00 +0000

In the previous post, I mentioned that CUE can help you work with both “industry-standard” semi-structured APIs and fully structured APIs where data is modelled using OpenAPI or JSON schema. However, there was an elephant in the room that I conveniently ignored but without which no conversation about network automation would be complete. With this post, I plan to rectify my previous omission and explain how you can use CUE to work with YANG-based APIs. More specifically, I’ll focus on OpenConfig and gNMI and show how CUE can be used to write YANG-based configuration data, validate it and send it to a remote device.

Automating YANG-based APIs with CUE

Working with YANG-based APIs is not much different from what I’ve described in the two previous blog posts [1] and [2]. We’re still dealing with structured data that gets assembled based on the rules defined in a set of YANG models and sent over the wire using one of the supported protocols (Netconf, Restconf or gNMI). One of the biggest differences, though, is that data generation gets done in one of the general-purpose programming languages (e.g. Python, Go), since doing it in Ansible is not feasible due to the sheer complexity of YANG schemas. What CUE can bring to the table is the data transformation and generation capabilities often found in general-purpose programming languages while still retaining the simplicity and readability of a DSL.

If we want to use CUE the main problem that we have to solve is figuring out how to generate the YANG-based CUE definitions. Since YANG is not widely used outside of the physical networking infrastructure space, CUE does not have a native language adaptor for YANG. However, CUE has integrations with a number of structured data standards which allows us to use one of them as an intermediate step.

One of the projects that can generate Go language bindings from a set of YANG models is openconfig/ygot. Fortunately, CUE understands Go and can generate its own definitions from Go types using the cue get go [packages] command. This makes the remainder of the network automation workflow very similar to what I’ve described in the previous post. We combine CUE definitions with user-provided data, validating its structure and values. Using CUE scripting, we can serialise this data into JSON and orchestrate gnmic to perform a Set RPC with the provided data in the payload.

Obviously, if things were that easy, I wouldn’t be writing this blog post now. YANG is a complicated language that was designed before our industry converged on a much (relatively) simpler set of schema standards. In the rest of this article, I will document what issues I hit when using the automatically-generated CUE definitions, how I worked around them and what challenges still lie ahead.

All code from this blog post can be found in the yang-to-cue github repository

Generating CUE definitions

One thing I wanted to get out of the gate is that if you want to use YANG-based APIs, most likely you would need to generate your language bindings or, in my case, CUE definitions automatically. There is absolutely no way you can (or should try to) create them manually. You can look at an average YANG model or a size of the generated library to understand what level of complexity you are dealing with.

With that in mind, the only way I could make it work is if I used the cue get go command, which means the first thing I had to do was generate Go types using the openconfig/ygot. I won’t focus on how to do it here, you can see an example in steps 1-3 of the workflow described in the yang-to-cue repo or read about it in the Go Network Automation book. Once you have those types defined, you can run the cue get go command and pull them into your CUE code, for example:

cue get go yang.to.cue/pkg/...

The above command would generate a [package]_go_gen.cue file per Go package containing everything that has been recognised and imported. This is where I started seeing issues and below I’ll explain what they are and how I fixed them.

Challenge 1 - Optional fields

When it comes to field optionality, CUE and YANG have opposite defaults. In YANG each node of a tree is optional by default, while in CUE all fields are mandatory unless they are explicitly marked as optional. When CUE imports definitions from Go types, it looks at each struct field and marks it optional if it is a pointer type. This, however, marks some of the fields as required, which goes against the YANG defaults.

The simplest solution is to walk through all of the fields defined in all of the structs and make them optional. CUE’s Go API includes a convenient helper function that traverses all nodes in a parsed CUE file and allows you to modify their content. Below is a snippet from the post-import.go script that does that:

case *ast.StructLit:
  for _, elt := range x.Elts {
    if field, ok := elt.(*ast.Field); ok {
      name, _, err := ast.LabelName(field.Label)
        if err != nil {
          log.Fatal(err)
        }
        if field.Optional == token.NoPos {
          log.Debugf("found mandadory field: %s", name)
          field.Optional = token.Blank.Pos()
        }
      }
    }

This was the simplest way to work around the problem. The downside is that we lose the ability to check if any field was marked mandatory by a YANG model. Fortunately, for this we first need to wait for ygot to implement this functionality, by which time CUE’s mandatory field proposal may get implemented as well, making the future solution a bit easier.

Challenge 2 - ENUMs

The second problem is caused by the way the openconfig/ygot deals with YANG enum types. Most enum types I’ve seen are aliases to int64 and each enum value is a constant (of enum type) that stores that enum’s value. When emitting the JSON value, ygot uses the constant to perform a lookup in the ΛEnum dictionary storing the actual enum name. The following excerpt from yang-to-go/pkg/yang.go file should make it a bit clearer:

type E_AristaIntfAugments_AristaAddrType int64

const (
  AristaIntfAugments_AristaAddrType_UNSET E_AristaIntfAugments_AristaAddrType = 0
  ...
  AristaIntfAugments_AristaAddrType_IPV6 E_AristaIntfAugments_AristaAddrType = 3
)
var ΛEnum = map[string]map[int64]ygot.EnumDefinition{
  "E_AristaIntfAugments_AristaAddrType": {
    1: {Name: "PRIMARY"},
    2: {Name: "SECONDARY"},
    3: {Name: "IPV6"},
  },
)

By default, CUE would ingest all enum types and store them as integers and wouldn’t know anything about the above map or its string values. So what I had to do was parse the auto-generated CUE file and patch the enum definitions by replacing integers (enum’s value) with strings (enum’s name) from the ΛEnum map. All this is done inside the same post-import.go script and the resulting CUE code looks something like this:

#enumE_AristaIntfAugments_AristaAddrType:
  #AristaIntfAugments_AristaAddrType_UNSET |
  #AristaIntfAugments_AristaAddrType_PRIMARY |
  #AristaIntfAugments_AristaAddrType_SECONDARY |
  #AristaIntfAugments_AristaAddrType_IPV6

#E_AristaIntfAugments_AristaAddrType: string

#AristaIntfAugments_AristaAddrType_UNSET: 
    { #E_AristaIntfAugments_AristaAddrType & "UNSET" }
#AristaIntfAugments_AristaAddrType_PRIMARY: 
    { #E_AristaIntfAugments_AristaAddrType & "PRIMARY" }
...

This definition would allow you to write values using concrete value strings, e.g. "addr-type": "PRIMARY" or simply refer to one of the globally defined constants, as in the following example from the yang-to-cue/values.cue:

config: {
  "addr-type": oc.#AristaIntfAugments_AristaAddrType_PRIMARY
  "prefix-length": 24
  ip: "192.0.2.1"
}

Challenge 3 - YANG lists

This ended up being the biggest challenge I had to solve. For all intents and purposes, a YANG list is a map (or a dictionary) with values identified by unique keys. So openconfig/ygot naturally stores YANG lists as Go maps. This makes it easier to ensure uniqueness and catch any duplicates. However, on the wire, a YANG list is represented as a list of objects ([...{}]), so when it’s time to emit a payload, ygot translates maps to lists, producing a valid RFC7951 JSON.

This last bit is unique to ygot’s serialization logic and by default remains unknown to CUE. So I’ve taken the most straightforward approach and converted all maps to lists before running the cue get go command. This is described in the readme of the yang-to-cue repository and can be accomplished with a little bit of sed magic:

sed -i -E 's/map\[.*\]\*(\S+)/\[\]\*\1/' pkg/yang.go

While this solves the problem of helping CUE generate a valid RFC7951 JSON, this does not guarantee YANG list entry uniqueness, leaving room for error. Fortunately, it’s possible to use CUE itself to introduce additional constraints and ensure all entries in a list are unique.

In the following example, I’m using a hidden field _check to store a set of YANG keys and compare its length to the length of the corresponding YANG list. As long as the list and a set of its keys have the same size, the validation passes and a payload is emitted by CUE.

#OpenconfigInterfaces_Interfaces: {
  interface: [...null | #OpenconfigInterfaces_Interfaces_Interface]
  _check: {
    for intf in interface {
      let key = intf.name
      "\(key)": true
    }
  }
  if len(_check) != len(interface) {_|_}
}

The above code snippet is automatically injected into every YANG list definition in CUE when the post-import.go is run with the default -yanglist=true argument. The actual injected code is slightly more complicated to account for the presence of composite keys (keys with more than one value) and includes a check that entry.key is always the same as entry.config.key as required by the Openconfig styling guide.

Outro

So where does all of the above leave us in relation to CUE and YANG? So far I was able to generate some pretty sizeable instances of YANG using CUE and apply validation rules imported from ygot packages. This makes me pretty comfortable I’ve reached the 80% feature coverage target I was aiming for a few months ago. Here’s an example from the yang-to-cue repo that you can successfully apply to any reachable Arista EOS device using the cue apply command.

package main

import oc "yang.to.cue/pkg:yang"

config: oc.#Device & {
  interfaces: interface: [{
    config: {
      description: "loopback interface"
      mtu:         1500
      name:        "Loopback0"
    }
    name: "Loopback0"
    subinterfaces: {
      subinterface: [{
        config: {
          description: "default subinterface"
          index:       0
        }
        index: 0
        ipv4: {
          addresses: {
            address: [{
              ip: "192.0.2.1"
              config: {
                "addr-type":     oc.#AristaIntfAugments_AristaAddrType_PRIMARY
                "prefix-length": 24
                ip:              "192.0.2.1"
              }
            }]
          }
        }
      }]
    }
  }]
  "network-instances": "network-instance": [{
    config: name: "default"
    name: "default"
    protocols: protocol: [{
      bgp: {
        global: config: as: 65000
        neighbors: neighbor: [{
          "afi-safis": "afi-safi": [{
            "afi-safi-name": oc.#OpenconfigBgpTypes_AFI_SAFI_TYPE_IPV4_UNICAST
            config: "afi-safi-name": "IPV4_UNICAST"
          }]
          config: {
            "neighbor-address": "169.254.0.1"
            "peer-as":          65001
          }
          "neighbor-address": "169.254.0.1"
        }]
      }
      config: {
        identifier: "BGP"
        name:       "BGP"
      }
      identifier: "BGP"
      name:       "BGP"
    }]
  }]
}

You can use the approach described in this blog post to write and validate YANG-compliant data entirely in CUE and, once CUE gets its own language server, writing this data would become even easier with IDE hints, autocompletion and error highlighting. Combine this with data generation and scripting capabilities described in the previous post and this gives you a versatile and robust toolset to work with YANG-based APIs, something that has been missing for a very long time.

There are still a few areas for improvement where CUE does not yet do as good a job as it could. One of them is error reporting in the YANG list validation logic. There’s no way to emit a custom error message, however, this may change once this proposal gets implemented. Another area for improvement could be extracting more metadata from Go types, but this seems to be unique to YANG/ygot so unlikely to get implemented in CUE natively. That being said, I hope that the approach that I’ve shown here – importing Go types using CUE and changing them later with a Go script – would work for most of the potential future improvements.

Since CUE is a pre-1.0 language, I would expect a few more things to change in the coming months. I doubt these changes would have any major negative impact on what I’ve written about CUE so far. If anything, they would improve the language, like the query proposal that would simplify CUE’s data generation capabilities or the function signatures proposal to allow external, user-provided code to be injected into the CUE evaluation process. So in my view now is the right time to start exploring CUE and injecting it into various parts of your network automation workflow. As you dig more into the details of the language, you’ll discover more interesting patterns and applications and, hopefully, CUE (Configure, Unify, Execute) becomes that common language for configuration and data, unifying different parts of IT infrastructure.

Network Automation with CUE - Advanced workflows

Tue, 22 Nov 2022 00:00:00 +0000

What I’ve covered in the previous blog post about CUE and Ansible were isolated use cases, disconnected islands in the sea of network automation. The idea behind that was to simplify the introduction of CUE into existing network automation workflows. However, this does not mean CUE is limited to those use cases and, in fact, CUE is most powerful when it’s used end-to-end — both to generate device configurations and to orchestrate interactions with external systems. In this post, I’m going to demonstrate how to use CUE for advanced network automation workflows involving fetching information from an external device inventory management system, using it to build complex hierarchical configuration values and, finally, generating and pushing intended configurations to remote network devices.

CUE vs CUE scripting

CUE was designed to be a simple, scalable and robust configuration language. This is why it includes type checking, schema and constraints validation as first-class constructs. There are some design decisions, like the lack of inheritance or value overrides, that may take new users by surprise, however over time it becomes clear that they make the language simpler and more readable. One of the most interesting features of CUE, though, is that all code is hermetic. What that means is all configuration values must come from local CUE files and cannot be dynamically fetched or injected into the evaluation process, so that no matter how many times or in which environment you run your CUE code, it always produces the same result.

However, as we all know, in real life configuration values may come from many different places. In the network automation context, we often use IP address and infrastructure management systems (IPAM/DCIM) to store device-specific data, often referring to these systems as a “source of truth”. I won’t focus on the fact that most often these systems are managed imperatively (point and click), making them a very poor choice for this task (how do you roll back?), but their dominance and popularity in our industry are undeniable. So how can we make CUE work in such environments?

CUE has an optional scripting layer, that is complementary to the core functionality of a configuration language. The CUE scripting (or tooling) layer works by evaluating files (identified by the _tool.cue suffix) that contain a set of tasks and executing them concurrently. These files are still written in CUE and can access the values defined in the rest of the CUE module, however, CUE tasks are allowed to make local and remote I/O calls and can be strung together to form some pretty complex workflows. As you may have guessed, this is what allows us to interact with external databases and remote network devices.

Advanced Network Automation Workflow

Let’s revisit the advanced network automation workflow, that was described in the CUE introduction post. What makes it different from the intermediate workflow is that host variables are sourced from multiple different places. In most common workflows, these places can be described as:

Local static variables, defined in host and group variables.
Variables injected by the environment, which often include sensitive information like secrets and passwords.
Externally-sourced data, fetched and evaluated during runtime.

Once this data is collected and evaluated, the remainder of the process looks very similar to what I’ve described in the previous blog post, i.e. this data is modified and expanded to generate a complete per-device set of variables which are then used to produce the final device configuration. The top part of the following diagrams is a visual representation of this workflow.

The bottom part shows how the same data sources are consumed in the equivalent CUE workflow. External data from IPAM/DCIM systems is ingested using the CUE scripting layer and saved next to the rest of the CUE values. CUE runtime now takes the latest snapshot of external data, combines it with other local CUE values and generates a set of per-device configurations. At this point, we can either apply them as-is or combine them with Jinja templates to generate a semi-structured text before sending it to the remote device.

In the rest of this blog post, I will cover some of the highlights of the above CUE workflow, while configuring an unnumbered BGP session between Arista cEOS and NVIDIA Cumulus Linux connected back-to-back. The goal is to show an example of how the data flows from its source all the way to its ultimate destination and how CUE can be used at every step of the way.

All code from this blog post can be found in the cue-networking-II github repository

Pulling Configuration Data from External Systems

For an external IPAM/DCIM system I’ll be using the public demo instance of Nautobot located at demo.nautbot.com. Since this is a demo instance, it gets rebuilt periodically, so I need to pre-populate it with the required device data. This is done based on the static inventory file and automated with the cue apply ./... command. The action of populating IPAM/DCIM systems with data is normally a day 0 exercise and is rarely included in day 1+ network automation workflows, so I won’t focus on it here. However, if you’re interested in an advanced REST API workflow orchestrated by CUE, you can check out the seed_tool.cue file for more details.

Once we have the right data in Nautobot, we can fetch it by orchestrating a number of REST API calls with CUE. However, since Nautobot supports GraphQL, I’ll cheat a little bit and get all the data in a single RPC. The query itself is less important, as it’s unique to my specific requirements, so I’ll focus only on the CUE code. In the fetch_tool.cue file I define a sequence of tasks that will get executed concurrently for all devices from the inventory:

Query the GraphQL API endpoint of Nautobot and unmarshal the response into a CUE struct.
Import the received data as CUE and save it in a device-specific directory.

All of the above can be done with a single cue fetch ./... command and the following snippet shows how the first task is written in CUE:

import (
	"text/template"
	"tool/http"
	"encoding/json"
)

command: fetch: {
 for _, dev in inventory.#devices {
  (dev.name): {
   gqlRequest: http.Post & {
    url:     inventory.ipam.url + "/graphql/"
    request: inventory.ipam.headers & {
     body: json.Marshal({
      query: template.Execute(gqlQuery.contents, {name: dev.name})
     })
    }
   }

   response: json.Unmarshal(gqlRequest.response.body)

   // save data in a file (omitted for brevity)
  }
 }
}

The above code snippet demonstrates how to make a single HTTP API call and parse the received payload using tool/http and encoding/json packages from the CUE’s standard library. The CUE scripting layer is smart enough to understand dependencies between tasks, e.g. in this case json.Unmarshal will only be called once the gqlRequest has returned a response, while still trying to run tasks concurrently (all GraphQL calls will be made at roughly the same time). This makes it highly efficient at almost no cost to the end user.

Data Transformation

At this point, it would make sense to talk a little about how CUE evaluates files from a hierarchical directory structure. In Ansible, it’s common to use “group” variables to manage settings common amongst multiple hosts. In CUE, you can use subdirectories to group related hosts and manage their common configuration values. Although my two-node test topology is not the best example for this, I still tried to group data based on the device role value extracted from Nautobot. This is what the ./config directory structure looks like. As you can see, host-specific CUE files are sitting in leaf/edge directories, while common data values and operations are defined in their parent directories:

Whenever a CUE script needs to evaluate data from one of these subdirectories (for example ./... tells CUE to evaluate all files recursively starting from the current directory), the values in the leaf subdirectories get merged with everything from their parents. So, for example, the lon-sw-01.cue values will get merged with ./lleaf/groupvars.cue but not with sspine/groupvars.cue, which will get merged with lon-sw-02.cue. This is just an example of how to optimise configuration values to remove boilerplate, you can check out my earlier cue-networking repository for a more complete real-world example.

So now in the leaf CUE files we’ve got the data that was retrieved from Nautobot, saved in a hostvars: [device name]: {} struct. That means in the topmost hostvars.cue file I’ve got access to all of that data and can start adding a schema and even do some initial value computations. You can view the resulting host variables with the cue try ./... command.

$ cue try ./...
-== hostvars[lon-sw-02] ==-
name: lon-sw-02
device_role:
  name: sspine
> snip <

The majority of the work is done in the transform.cue file, where hostvars get transformed into a complete structured device configuration. As I’ve already covered data transformation in the previous blog post, I won’t focus too much on it here, and invite you to walk through the code on your own. However, before moving on, I want to discuss the use of schemas in the data transformation logic, e.g. nvidia.#set in the below code snippet from the transform.cue file:

nvidiaX: {
  _input: {}
  nvidia.#set & {
    interface: {
      for _, intf in _input.interfaces {
        if strings.HasPrefix(intf.name, "loopback") {
          lo: {
            ip: address: (intf.ip_addresses[0].address): {}
            type: "loopback"
// omitted for brevity

Although schemas are optional, they can give you additional assurance that what you’re doing is right and catch errors before you try to use the generated data. Moreover, once CUE gets its own language server, writing the code would become a lot easier with IDE’s help. Similar to Go, you would get features like struct templates, autocompletion and error highlighting.

The biggest problem with using a schema is generating it in the first place. I’ve briefly touched upon this subject in the previous blog post but want to expand a bit on it here. Doesn’t matter if you work with a model-compliant API (OpenAPI or YANG) or with templates that generate a semi-structured set of CLI commands, you can always describe their input with a data model. CUE understands a few common schema languages and can import and generate its own definitions from them. So now all that we need to do is generate that data model somehow.

In some cases, you may be in luck if your vendor already publishes these models, however, this time I’ll focus on how to generate them manually. The detailed step-by-step process is documented in the GitHub repository, but here I want to summarise some of the key points:

If your device manages its configuration as structured data (the case of NVIDIA Cumulus Linux), you can generate a JSON schema from an existing configuration instance. For example, I’ve worked out the exact set of values I need to configure first, saved it in a YAML file and ran it through YAML to JSON schema converter.
If you have to use text templates to produce the device config (the case of Arisa EOS), you can infer a JSON schema from a Jinja template (see this script for an example).
CUE can correctly recognise the JSON schema format and import it as native definitions using the cue import command.
Following the initial (double) conversion, some of the type information may get lost or distorted, so most likely you would need to massage the automatically generated CUE schema before you can use it. This, however, only needs to be done once, since you can discard the intermediate schema files and carry on working exclusively with CUE definitions from now on.

You can view the generated structured device configurations, produced by the data transformation logic, by running the cue show ./... command.

Configuration Push

This is the final stage of the CUE workflow where, once again, I use CUE scripting to interact with Arista’s JSON RPC and NVIDIA’s REST APIs. All that is done as a part of a user-defined cue push ./... command that executes multiple vendor-dependent workflows in per-device coroutines. You can find the complete implementation in the main_tool.cue file, and here I’d like to zoom in on a few interesting concepts.

The first one is authentication and secret management. As I’ve mentioned before, one of the common ways of injecting secrets is via environment variables, e.g. if you running a workflow inside a CI/CD system. While CUE cannot inject them natively, you can achieve the same result using the @tag keyword. A common pattern is to define default values that can be overridden with a user-provided command line tag, like in the following snippet from the inventory.cue file:

auth: {
  nvidia: {
    user:     *"cumulus" | string @tag(nvidia_user)
    password: *"cumulus" | string @tag(nvidia_pwd)
  }
  arista: {
    user:     *"admin" | string @tag(arista_user)
    password: *"admin" | string @tag(arista_pwd)
  }
}

When calling any CUE script, you can now pass an additional -t tag_name=tag_value flag that will get injected into your code. For example, this is how I would change the default password for Arista:

export ARISTA_PWD=foo
cue push -t arista_pwd=$ARISTA_PWD ./...

Another interesting concept is the function pattern. It’s an ability to abstract a reusable piece of CUE code in a dedicated struct that can be evaluated when needed by any number of callers. I’ve used this pattern multiple times in most of the _tool.cue files, but below I’ll cover its simplest form.

Before we can send the generated configuration to Arista eAPI endpoint, we need to wrap it with a few special keywords – enable, configure and write. This is done in a special struct called eapi_wrapper. This struct defines an abstract schema for its input (a list of strings) but performs some concrete actions on it (wraps it in special keywords). In order to “call” this “function” we unify it with a struct that we know will define these inputs as concrete values. CUE runtime will delay the evaluation of this function struct until all of its inputs are known. In the following example, once CUE generates a list of CLI commands in the split_commands list, it will evaluate the “function call” expression and the result will become available to subsequent tasks in wrapped_commands.output.

eapi_wrapper: {
  input: [...string]
  output: ["enable", "configure"] + input + ["write"]
}

command: push: {
  for _, dev in inventory.#devices {
    (dev.name): {
      // ...
      wrapped_commands: eapi_wrapper & {input: split_commands}
      // ...
    }
  }
}

The last concept I wanted to cover is the sequencing of tasks in CUE scripts. As I’ve mentioned before, CUE runtime is able to infer the implicit dependencies between tasks and evaluate them in the right order. This happens when an input of one task consumes an output from another task. This way you can just focus on writing code, while CUE will do its best to parallelise as many tasks as it can.

However, some tasks don’t have implicit dependencies but still need to be run in sequence. A good example of this is the interaction with NVIDIA’s NVUE API. The procedure to apply the generated configuration consists of 3 stages – (1) creating a new configuration revision, (2) patching this revision with the generated data and (3) applying it. While 1-2 and 1-3 have implicit dependencies (revision ID generated in 1), stages 2 and 3 don’t, but 3 must always happen after 2. The way we can make it happen is by adding $after to the third task, referencing the name of the second. This little trick allows CUE to build the right graph of dependencies and apply the revision only after it has been patched.

createRevision: http.Post & {
  url: "https://\(dev.name):8765/nvue_v1/revision"
  // ...
}

patchRevision: http.Do & {
  method: "PATCH"
  url:    "https://\(dev.name):8765/nvue_v1/?rev=\(escapedID)"
  // ...
}

applyRevision: http.Do & {
  $after: patchRevision
  method: "PATCH"
  url:    "https://\(dev.name):8765/nvue_v1/revision/\(escapedID)"
  // ...
}

You can see the complete example of the last two concepts in the main_tool.cue file and a few more advanced workflows in seed_tool.cue.

Outro

You can test the complete CUE workflow in a virtual environment with the help of containerlab:

Build the lab with cue lab-up ./...
Pre-seed the demo Nautobot instance with cue apply ./...
Import the data from Nautobot with cue fetch ./...
Push the generated device configs with cue push ./...

You can verify that everything works as intended by pinging the peer device’s loopback, e.g. docker exec lon-sw-01 ping 198.51.100.2. More importantly, at this stage, we have managed to replace all functions of Ansible, while having improved the data integrity, added flexibility and made our network automation workflow more robust.

Another interesting bonus of using CUE, when compared to Ansible, is the reduced resource utilisation. Due to a completely different architecture, CUE consumes a lot fewer resources and works much faster than Ansible, while doing essentially the same work. I’ve done some measurements of how CUE compares to Ansible when doing remote machine execution (running commands via SSH) and making remote API calls and in both cases CUE outperforms Ansible across major dimensions. In the most extreme case (CUE API vs Ansible API), CUE is more than 3 times faster and consumes less than 8% of the memory required by Ansible. You can find this and other results in the cue-ansible repository.

I think at this point I’ve covered all that I wanted about CUE and how it can be used for common network automation workflows. My hope is that people can see that there is a better alternative to what we use today and keep an open mind when making their next decision.

If you feel like this is something unfamiliar and strange, remember that Ansible and Python all used to feel like that at some point in the past. If you have the desire to do things better and learn new things, then CUE can offer a lot in both departments.

P.S. I still have enough material for another blog post about CUE and YANG. I haven’t finished exploring this topic so it may be a very small article, depending on how it goes. Stay tuned for more.

Network Automation with CUE - Augmenting Ansible workflows

Fri, 11 Nov 2022 00:00:00 +0000

Hardly any conversation about network automation that happens these days can avoid the topic of automation frameworks. Amongst the few that are still actively developed, Ansible is by far the most popular choice. Ansible ecosystem has been growing rapidly over the last few years, with modules being contributed by both internal (Redhat) and external (community) developers. Having the backing of one of the largest open-source first companies has allowed Ansible to spread into all areas of infrastructure – from server automation to cloud provisioning. By following the principle of eating your own dog food, Redhat used Ansible in a lot of its own open-source projects, which made it even more popular in the masses. Another important factor in Ansible’s success is the ease of understanding. When it comes to network automation, Ansible’s stateless and agentless architecture very closely follows a standard network operation experience – SSH in, enter commands line-by-line, catch any errors, save and disconnect. But like many complex software projects, Ansible is not without its own challenges, and in this post, I’ll take a look at what they are and how CUE can help overcome them.

Ansible Automation Workflow

Let’s start with an overview of the intermediate Ansible automation workflow, that was described in the previous post, and try to see what areas are more prone to human error or may require additional improvement. In order to do that, I’ll break it down into a sequence of steps describing how configuration data travels through this automation workflow, where it gets mutated and how it is used:

A user creates a playbook, a device inventory and a set of variables describing the desired state of the network.
Ansible runtime parses all input data and calculates a per-host set of variables.
This set of high-level variables gets transformed into a larger set of low-level variables.
The entire set of variables is now passed to a config generation module which combines them with one or more Jinja templates.
The resulting semi-structured text is applied to the running device configuration.

One of the first places where we can make a mistake is the input data. Specifically, a set of input variables is essentially a free-form YAML data structure with values sourced from up to 22 different places. There’s no way to verify that the shape of the input data structure is correct and the only way to validate the type of values is by using filters.

However, even with filters, you can never be sure the returned value has the right type, as filters are built to “fail safe”. For example, the ansible.utils.ipaddr filter will return the input value (as a string) if it’s a valid IP address, but will return a boolean False if it isn’t, conflating the returned value and an error in a single variable. There’s no way to abort Ansible execution or signal to the user that the input value was incorrect unless you use assert statements, which become pretty ineffective even with relatively small volumes of data.

The next place where things can go wrong is the data transformation stage. This can be anything from a simple builtin.set_fact module with a bunch of filters to what I describe as “Jinja programming” – manipulating data structures using Jinja’s expression statements (e.g. set and do tags) or even building a structured document (YAML, JSON) using string interpolation. In any case, the likelihood of making a mistake gets even higher since both the input data and the transformation logic itself are dynamically-typed and Jinja is notorious for becoming incomprehensible very quickly.

Now we’re at the config generation phase where, once again, the input variables are passed without validation which means you can easily get tripped by one of the YAML idiosyncrasies and troubleshooting Jinja templating errors is particularly painful as errors are often reported with a vague “undefined variable” message.

Finally, one of the unlikely places that can benefit from CUE is the API interactions with remote devices. CUE’s scripting capabilities can orchestrate interaction with multiple HTTP-based APIs and, if possible, would do this concurrently. This not only accelerates execution but also reduces resource utilisation thanks to the CUE’s (Go’s) lightweight concurrency model compared to Ansible’s more expensive os.fork() approach.

If you go back and look at the first two areas I’ve identified above, you can see that they can easily be done by an external tool and integrated into any existing Ansible workflow without making any serious changes to how the config is generated or delivered. These will be the two things I’m going to cover in this post.

The final two areas are more disruptive but may allow you to replace Ansible completely for pretty much any non-SSH API automation, i.e. JSON-RPC or REST APIs. I’ll cover them in the following article.

Input Data Validation

If you’re thinking about giving CUE a try and now sure where to start, input data validation could be your best option. Creating a schema for input data is a good exercise to test and explore the language while having no negative impact on your automation workflow. The benefits, however, are worth it as the schema will improve your automation workflow by:

Validating the structural shape of input variables to catch any potential indentation errors
Making sure all variables have the right type and catch any typos before you run the playbook

This could also be a good place to introduce additional constraints for values, for example, to verify that BGP ASN is within a valid range or if IP addresses are valid. In general, once you’ve started with a simple schema, you can continue mixing in more policies to tighten the range of allowed values and improve the overall data integrity.

Let’s see a concrete example of how to develop a CUE schema to validate input variables using Cumulus’s golden turtle Ansible modules. Get yourself a copy of this repository:

git clone https://gitlab.com/cumulus-consulting/goldenturtle/cumulus_ansible_modules.git && cd cumulus_ansible_modules

You’ll find several validated network topologies inside of the inventories/ directory together with a set of input variables spread across standard Ansible group and host variable directories. To make this example a bit simpler, I’ll focus on the bonds (link aggregation) configuration, and the following example shows a snippet of the bonds variable from the group_vars/leaf/common.yml file:

bonds:
  - name: bond1
    ports: [swp1]
    clag_id: 1
    bridge:
      access: 10
    options:
      mtu: 9000
      extras:
        - bond-lacp-bypass-allow yes
        - mstpctl-bpduguard yes
        - mstpctl-portadminedge yes

I’ve picked this example deliberately because it contains many places where we can make a mistake, but also because it can be very succinctly summarized by the following CUE schema:

#bonds: [...{
    name: string
    ports: [...string] 
    clag_id: int
    bridge: access: int
    options: {
        mtu: int & <9999
        extras: [...string]
    }
}]

bonds: #bonds

Here we’ve created a CUE definition that describes the structure and type of values expected in the bonds variable. The last line “applies” the #bonds schema to any existing bonds variable. Assuming the above schema is saved in the bonds.cue file, we can check if the input variables conform to it with the following command:

$ cue vet bonds.cue inventories/evpn_symmetric/group_vars/leaf/common.yml

Now let’s introduce a mistake by changing the value of MTU in the input variable. The resulting error message tells us exactly where the error is and why it’s not valid:

$ sed -i 's/mtu: 9000/mtu: 90000/' inventories/evpn_symmetric/group_vars/leaf/common.yml
$ cue vet bonds.cue inventories/evpn_symmetric/group_vars/leaf/common.yml
bonds.0.options.mtu: invalid value 900000 (out of bound <9999):
    ./bonds.cue:8:20
    ./inventories/evpn_symmetric/group_vars/leaf/common.yml:27:13

You can experiment a bit more by changing the values in the input data, for example, try changing ports to an empty list or left-shifting the indentation of access: 10 line.

Creating schemas for every input variable can be a tedious process. However, there’s a shortcut you can take that can get you a working schema relatively easily. It’s a two-step process:

Use one of the open-source code generators to produce (infer) a JSON Schema from a YAML, JSON or a Jinja template document
Convert JSON Schema to CUE using the cue import command.

To make it easier to follow, I’ve run through the original bonds variable through an online converter, saved the result in a schema.json file, and imported it using the cue import -f -p schema schema.json command. The resulting schema.cue file contained the following:

bonds: [...#Bond]

#Bond: {
        name: string
        ports: [...string]
        clag_id: int
        bridge:  #Bridge
        options: #Options
}

#Bridge: access: int

#Options: {
        mtu: int
        extras: [...string]
}

Although it’s a slightly different (more verbose) version of my hand-written CUE schema, most of the values are exactly the same. The only bits that are missing are constraints and policies, which are optional and can be added at a later stage. You can find another example of the above process on the Jinja to CUE page of my cue-ansible repo.

Once you have your schemas developed, you can start adding them to an existing Ansible workflow. Here are some ideas of how this can be done, starting from the easiest one:

You can add an extra task to the top of your Ansible playbook that uses shell module to execute cue vet against input variables.
If you have an existing CI system, you can add the cue vet as a new step before ansible-playbook command is executed.
Another option is to create a custom module that can be configured to run CUE schema validation for any schema or input variables.

The last option requires you to write an Ansible module in Go, but it allows you to have a native way of providing inputs and consuming outputs:

- name: Validate input data model with CUE
  cue_validate:
    schema: "schemas/input.cue"
    input: "{{ hostvars[inventory_hostname] | string | b64encode }}"
  delegate_to: localhost

You can find a reference implementation of this module with an example workflow in the Validation page of my cue-ansible repo.

Data Transformation

At this point, we’ve only used CUE for schema validation. The next logical step is to ingest all input values in CUE and start working with them as native CUE values. There are many benefits to using CUE for value management, and I’ll cover some of them in the following blog posts, but for now, let me focus on a very common task of data transformation.

For demonstration purposes, I’ll be using Arista’s Validated Design (AVD) as it’s one of the most interesting examples of data transformation done in Ansible. AVD uses a combination of custom Python modules and Jinja templates to transform high-level input data and generate structured configs that have all the values required by devices. My goal would be to demonstrate CUE’s data transformation capabilities by removing parts of Ansible code and Jinja templates and replacing them with CUE code, while keeping both inputs and outputs unchanged.

Let’s start by cloning the AVD repo and pinning the Ansible collection path to that directory.

$ git clone https://github.com/aristanetworks/ansible-avd.git && cd ansible-avd
$ export ANSIBLE_COLLECTIONS_PATH=$(pwd)
$ export OUT_DIR=intended/structured_configs

Using one of the included example topologies, I run through the entire data transformation stage shown in the above diagram, first without using CUE.

$ cd ansible_collections/arista/avd/examples/l2ls-fabric
$ ansible-playbook build.yml  --tags build,facts,debug

In the ./intended/structured_configs directory, I now have a set of structured device configs and input host variables. Next, I’m going to do two things:

Import all input host variables to allow me to use them natively as CUE values.
Save the generated structured device configuration of LEAF1 switch as a baseline for future comparison (I’m running it through cue eval --out=yaml simply to update the indentation).

$ cue import -p hostvars -f $OUT_DIR/LEAF1-debug-vars.yml
$ mv $OUT_DIR/LEAF1-debug-vars.cue leaf1.cue
$ cue eval $OUT_DIR/LEAF1.yml --out=yaml > $OUT_DIR/LEAF1.base.yml

In order to keep the input values separate from the data transformation logic, I’ve moved them into their own hostvars package using the -p flag in the command above. CUE’s code organisation practices are very similar to Go’s (programming language) and allow me to group code into packages and group similar packages into modules. In order to import the hostvars package, I first need to initialise a CUE module:

cue mod init arista.avd

Now I can create a new file called transform.cue and get access to all input variables using the arista.avd:hostvars import statement. From here on, I can use a standard set of data manipulation techniques like the for loop, string interpolation, variable declarations and conditionals to expand the high-level data model into a low-level structured configuration, focusing only on port channel interfaces for this example:

package avd

import (
	"arista.avd:hostvars"
	"strconv"
)

// Uplink port channels
port_channel_interfaces: {
	for link in hostvars.switch.uplinks if link.channel_group_id != _|_ {
		let groupID = strconv.Atoi(link.channel_group_id)

		"Port-Channel\(groupID)": {
			description: link.channel_description + "_Po\(groupID)"
			type:        "switched"
			shutdown:    false
			if link.vlans != _|_ {
				vlans: link.vlans
			}
			mode: "trunk"
			if hostvars.switch.mlag != _|_ {
				mlag: groupID
			}
		}
	}
}

// MLAG port channels
if hostvars.switch.mlag != _|_ {
    port_channel_interfaces: {
        let groupID = strconv.Atoi(hostvars.switch.mlag_port_channel_id)

        "Port-Channel\(groupID)": {
            description: "MLAG_PEER_" + hostvars.switch.mlag_peer + "_Po\(groupID)"
            type: "switched"
            shutdown: false,
            vlans: hostvars.switch.mlag_peer_link_allowed_vlans
            mode: "trunk",
            trunk_groups: ["MLAG"]
        }
    }
}

The if value != _|_ expression in the above example is a check if a value is defined, where _|_ is a special “bottom” or error value.

The example above contains enough data transformation logic to generate the required set of port-channel interfaces, and can be checked as follows:

$ cue eval transform.cue
port_channel_interfaces: {
    "Port-Channel47": {
        description: "MLAG_PEER_LEAF2_Po47"
        type:        "switched"
        shutdown:    false
        vlans:       "2-4094"
        mode:        "trunk"
        trunk_groups: ["MLAG"]
    }
    "Port-Channel1": {
        description: "SPINES_Po1"
        type:        "switched"
        shutdown:    false
        vlans:       "10,20"
        mlag:        1
        mode:        "trunk"
    }
}

Now let’s remove the port channel data generation logic from AVD’s Python module and completely wipe out a corresponding Jinja template:

$ sed -i '/port_channel_interface_name: port_channel_interface,/d' ../../roles/eos_designs/python_modules/mlag/__init__.py
$ cat /dev/null > ../../roles/eos_designs/templates/underlay/interfaces/port-channel-interfaces.j2

I re-run the playbook again to see what results I get after the above changes:

$ ansible-playbook build.yml  --tags build,facts,debug
$ cue eval $OUT_DIR/LEAF1.yml --out=yaml > $OUT_DIR/LEAF1.new.yml

The resulting structured config should contain no port channel configuration data, which I verify by comparing with the baseline:

$ diff $OUT_DIR/LEAF1.new.yml $OUT_DIR/LEAF1.base.yml
67c67,82
< port_channel_interfaces: {}
---
> port_channel_interfaces:
>   Port-Channel47:
>     description: MLAG_PEER_LEAF2_Po47
>     type: switched
>     shutdown: false
>     vlans: "2-4094"
>     mode: trunk
>     trunk_groups:
>       - MLAG
>   Port-Channel1:
>     description: SPINES_Po1
>     type: switched
>     shutdown: false
>     vlans: 10,20
>     mode: trunk
>     mlag: 1

However, since I already have the correct port channel data produced by my CUE code, I can merge it with the latest structured config. Note that I pass both CUE and YAML files as the input to the cue eval command, leaving it up to CUE to recognise the type, import and evaluate everything as a single set of CUE values.

$ cue eval transform.cue $OUT_DIR/LEAF1.yml --out=yaml > $OUT_DIR/LEAF1.new.yml

Re-running the earlier diff command should show that the new structured device config looks exactly the same as the baseline (with a minor exception of struct field re-ordering). This means I have generated the same exact output from the same set of inputs, bypassing Python and Jinja and moving all port-channel data transformation logic into CUE. This way I have consolidated and unified data transformation and made it easier to read and reason about.

Now that I’ve covered the first two stages of the advanced automation workflow, it’s time to move on to the final two stages and wrap up the Ansible portion of this blog post series. In the next post, I’ll show how to hierarchically organise CUE code to minimise boilerplate, how to work with externally-sourced data like IPAM or secret stores and use CUE’s scripting to apply configurations to multiple devices at the same time.

Network Automation with CUE - Introduction

Thu, 27 Oct 2022 00:00:00 +0000

In the past few years, network automation has made its way from a new and fancy way of configuring devices to a well-recognized industry practice. What started as a series of “hello world” examples has evolved into an entire discipline with books, professional certifications and dedicated career paths. It’s safe to say that today, most large-scale networks (>100 devices) are at least deployed (day 0) and sometimes managed (day 1+) using an automated workflow. However, at the heart of these workflows are the same exact principles and tools that were used in the early days. Of course, these tools have evolved and matured but they still have the same scope and limitations. Very often, these limitations are only becoming obvious once we hit a certain scale or complexity, which makes it even more difficult to replace them. The easiest option is to accept and work around them, forcing the square peg down the round hole. In this post, I’d like to propose an alternative approach to what I’d consider “traditional” network automation practices by shifting the focus from “driving the CLI” to the management of data. I believe that this adjustment will enable us to build automation workflows that are much more robust and scalable and there are emerging tools and practices that were designed to address exactly that.

Evolution of Network Configuration Management

In order to understand why data management is important, we need to have a closer look at what constitutes a typical network automation workflow. The most basic process starts by combining a device data model, represented by a free-form data structure (e.g. YAML), with a text template (e.g. Jinja) to produce the desired device configuration. This configuration is then passed to a function that implements the underlying transport protocol (e.g. netmiko), which applies the desired changes. Some of these steps can be abstracted away by automation frameworks (e.g. Ansible) but largely the process still looks the same under the hood. This is what you can see on the left-hand side of the following diagram:

While the basic workflow may work well for the initial network configuration, it is rarely suitable for ongoing operations due to its inherent verbosity. The natural reaction to that is to create another layer of abstraction that hides common design conventions, configuration defaults and computable attributes behind a terse high-level data model, as depicted by the “intermediate workflow” in the above diagram. This high-level data model simplifies the end-user experience of interacting with the automation workflow, but it comes at the expense of additional complexity hidden in the high-to-low-level translation logic.

Finally, some physical networks have decided to replicate the self-service cloud experience by allowing some parts of their state to be managed dynamically. One simple example is to allow a compute team to manage the VLAN assignment on the downlink network ports. This meant that a single, flat-text data structure is no longer enough to store the high-level configuration intent, and we split it across multiple (preferably) non-overlapping sources of truth, visualized by the “advanced workflow” in the above diagram.

If you look at the above diagram, you might notice one theme that emerges and evolves together with the complexity of automation workflows. It is the ever-increasing focus on data. Thanks to the growing number of templates, we started caring less about individual vendor configuration dialects and more about how to source, structure and combine input configuration values. I would argue that these input values have become the new API, since the old APIs (“industry-standard” CLI) were not built for automation and eventually got abstracted away by libraries like scrapli or netmiko and a ton of (mostly) Jinja templates.

The same argument can be applied to the YANG-based APIs, which were designed for machine-to-machine communication and are slowly but steadily getting more traction. Those APIs are often abstracted away by software platforms, such as OpenDaylight or Tail-f NSO, or libraries like ygot, and operational tasks are, once again, reduced to the management of input data.

Automation Tools

I want to frame the discussion of automation tools in the context of the configuration complexity clock. The main premise of this theory is that the process of finding the right level of abstraction for configuration values is cyclical. Here’s my free interpretation of the original story, translated into the network automation reality:

00:00: We need to configure a network but don’t have time for proper planning, so we have all configurable values hard-coded in flat-text configuration files and simply push them to network devices.
03:00: We realise that some parts of the network need to change, so we extract some of the hard-coded values, simplify them and make them configurable.
06:00: The size of the configurable values continues to grow and we start building guardrails to prevent typical configuration mistakes and guarantee value uniqueness across the environment. We create a schema to validate input data and may even expose it via a GUI.
09:00: At some point the guard rails, schemas and policy engines start being a hurdle and we decide to consolidate all of them in a single framework, driven by its own DSL. Quickly realising that the framework can not meet all our requirements, we start extending it with custom code.
12:00: Now we have all our policies embedded in custom framework extensions and values hard-coded in the DSL, the network management process looks not much different from where we started.

Strictly speaking, this is more of a fable than a theory. It shows what a constant strive for improvement can do to an application’s user interface. If you read the original article, the author says that very few organisations go all the way around the clock, which means the majority settle somewhere in between. If we look at the current state of network automation, we can see a confirmation of that – majority of the network operators settle on one of the following two options:

Everything is done with a DSL (Ansible + Jinja)
Everything is done with a general-purpose programming language (Python)

Similar to the choice of their preferred hardware vendor, the choice between the above two options can be almost religious to some people. There are engineers who wouldn’t want to come close to Ansible and there are those who shove all the logic into Ansible DSL, ignoring the exponentially-increasing complexity. The most important point is that both groups seem to have settled on their choice and accepted the caveats and limitations resulting from their decisions. So far, I have not seen any attempts to upset this status quo by exploring and explaining alternative options.

What if there was another option that would allow us to write a true statically-typed code instead of using type hints, avoid variable override hell, with built-in low-cost concurrency and task orchestration? What if we could use a tool that was purposefully built to transform and generate configuration data instead of engaging in Jinja programming with Ansible (that was never designed for this) or trying to write error-proof and readable Python?

The author of the configuration complexity clock article cautions us against making rash decisions (especially from rolling your own DSL) and also suggests that at a low-enough scale simpler solutions may be the best option. I would agree with him. If you think that you get enough out of your current automation solution – you don’t feel like you’re swimming against the tide all the time and you’re confident that when you move on, the next person will be able to pick up and continue your work without re-write everything from scratch – then you don’t need to change. However, I’d like to show you that you can do better. You can create a solution that is faster, more robust to failures and easier to understand and extend. Like with anything new, there’s a price you have to pay, by learning and changing your automation workflows, but the ultimate benefit may very well be worth it.

Introducing CUE

CUE or cuelang was built to manage configuration data which, as we’ve seen above, is one of the most critical parts of advanced network automation workflows. CUE tries to strike a balance between the simplicity of a DSL and the efficiency of a general-purpose programming language. Visually, it looks very similar to JSON (it is a superset of JSON) with a relaxed grammar, e.g. you can leave comments and omit string quotes for field names. This is an example of a CUE syntax that defines a set of BGP configuration values:

bgp: {
  asn: 65123
  router_id: "192.0.2.1"
  neighbors: {
    swp51: {
      unnumbered: true
      remote_as:  "external"
    }
    swp52: {
      unnumbered: true
      remote_as:  "external"
    }
  }
}

The main idea is that you write all your configuration values, constraints and code generation rules in CUE code. It becomes your new source of truth and can later output values in YAML or JSON format, which you can either pass to a text template (e.g. Jinja) to generate a semi-structured configuration or send as-is to a remote device (in case it supports structured input).

One of the two strongest qualities of CUE for network automation workflows (in my opinion) is static data typing. While we can work with the free-form data defined above, we can easily create a simple schema that would ensure that both the shape of the bgp struct and the type of all its values are exactly what we’d expect. Here’s the most straightforward way of doing this – we define another data structure with the same name, CUE will unify them and validate the values above are correct:

bgp: {
  asn:       int
  router_id: net.IPv4 & string
  neighbors: [=~"^swp"]: {
    unnumbered: bool | *true
    remote_as:  int | "external" | "internal"
  }
}

In the above example, we mix static typing (asn value must be an integer) with constraints (router_id is a string that is also a valid IPv4 address), defaults (default value for unnumbered is true) and regex matching (only apply the constraints and defaults to neighbors starting with swp). Now we can safely add or remove additional types and constraints as our data evolves, relying on CUE to produce the correct configuration values.

Another big selling point of CUE is its powerful data templating and generation capabilities. CUE natively supports value interpolation, conditional fields and for loops which allow us to generate larger data sets from smaller, more concise inputs. In addition, you can import helper packages from CUE’s standard library to perform common data operations. The following contrived example demonstrates the use of field comprehension (the for loop), local variables (the let keyword), conditionals and two helper packages from the standard library:

import (
  "strings"
  "strconv"
)

uplinks: ["swp53", "swp54"]

bgp: neighbors: {
  for uplink in uplinks {
    let parts = strings.SplitAfter(uplink, "swp")

    if len(parts) > 1 {
      let intfNum = strconv.ParseInt(parts[1], 10, 32)

      if intfNum >= 50 {
        "\(uplink)": {
          remote_as: "external"
        }
      }
    }
  }
}

You can find the above code examples in the CUE playground (link) and experiment by changing the values and observing the result, for example:

Change the asn field to a string instead of an integer
Try adding a couple of new values to the uplink list, e.g. swp50, swp49
Change the router_id field to contain an invalid IPv4 address
In the drop-down menu at the top of the page, change the output to JSON or YAML

With the examples above, we’re just scratching the surface of what CUE is capable of. Things I haven’t covered here include module packaging, integration with OpenAPI, YAML, JSON and Go, and the built-in support for external network calls. The goal of the current article is mainly to whet your appetite but I’ll try to cover these and other interesting features in the following blog posts. Here’s what you can expect to find about in the upcoming material:

Augment your existing Ansible-based automation workflow with CUE
How to use CUE for YANG-based APIs
Orchestrate API interactions with remote devices
Reducing configuration boilerplate
Performance comparison of CUE vs Ansible

Containerising NVIDIA Cumulus Linux

Tue, 25 May 2021 00:00:00 +0000

In one of his recent posts, Ivan raises a question: “I can’t grasp why Cumulus releases a Vagrant box, but not a Docker container”. Coincidentally, only a few weeks before that I had managed to create a Cumulus Linux container image. Since then, I’ve done a lot of testing and discovered limitations of the pure containerised approach and how to overcome them while still retaining the container user experience. This post is a documentation of my journey from the early days of running Cumulus on Docker to the integration with containerlab and, finally, running Cumulus in microVMs backed by AWS’s Firecracker and Weavework’s Ignite.

Innovation Trigger

One of the main reason for running containerised infrastructure is the famous Docker UX. Containers existed for a very long time but they only became mainstream when docker released their container engine. The simplicity of a typical docker workflow (build, ship, run) made it accessible to a large number of not-so-technical users and was the key to its popularity.

Virtualised infrastructure, including networking operating systems, has mainly been distributed in a VM form-factor, retaining much of the look and feel of the real hardware for the software processes running on top. However it didn’t stop people from looking for a better and easier way to run and test it, some of the smartest people in the industry are always looking for an alternative to a traditional Libvirt/Vagrant experience.

While VM tooling has been pretty much stagnant for the last decade (think Vagrant), containers have amassed a huge ecosystem of tools and an active community around it. Specifically in the networking area, in the last few years we’ve seen commercial companies like Tesuto and multiple open-source projects like vrnetlab, docker-topo, k8s-topo and, most recently containerlab.

So when I joined Nvidia in April 2021, I thought it’d be a fun experiment for me to try to containerise Cumulus Linux and learn how the operating system works in the process.

Peak of Inflated Expectations

Building a container image was the first and, as it turned out, the easiest problem to solve. Thanks to the Debian-based architecture of Cumulus Linux, I was able to build a complete container image with just a few lines:

FROM debian:buster

COPY data/packages packages
COPY data/sources.list /etc/apt/sources.list
COPY data/trusted.gpg /etc/apt/trusted.gpg
RUN apt install --allow-downgrades -y $(cat packages)

I extracted the list of installed packages and public APT repos from an existing Cumulus VX VM, copied them into a base debian:buster image and ran apt install – that’s how easy it was. Obviously, the actual Dockerfile ended up being a lot longer, but the main work is done in just these 5 lines. The rest of the steps are just setting up the required 3rd party packages and implement various workarounds and hacks. Below is a simplified view of the resulting Cumulus image:

Once the image is built, it can be run with just a single command. Note the presence of the privileged flag, which is the easiest way to run systemd and provide NET_ADMIN and other capabilities required by Cumulus daemons:

docker run -d --name cumulus --privileged networkop/cx:latest

A few seconds later, the entire Cumulus software stack is fully initialised and ready for action. Users can either start an interactive session or run ad-hoc commands to communicate with Cumulus daemons:

$ docker exec cumulus net show system
Hostname......... 5b870d5c3d31
Build............ Cumulus Linux 4.3.0
Uptime........... 13 days, 5:03:30.690000

Model............ Cumulus VX
Memory........... 12GB
Disk............. 256GB
Vendor Name...... Cumulus Networks
Part Number...... 4.3.0
Base MAC Address. 02:42:C0:A8:DF:02
Serial Number.... 02:42:C0:A8:DF:02
Product Name..... Containerised VX

All this seemed pretty cool but I still had doubts over the functionality of Cumulus dataplane on a general-purpose kernel. Most of the traditional networking vendors do not rely on native kernel dataplane and heavily modify or bypass it completely in order to implement all of the required NOS features. My secret hope was that Cumulus, being the Linux-native NOS, would somehow make it work with just a standard set of kernel features. The only way to find this out was to test.

Building a test lab

I’ve decided that the best way to test is to re-implement the Cumulus Test Drive environment to make use of Ansible playbooks that come with it. Here’s a short snippet of containerlab’s topology definition matching the CTD’s topology:

name: cldemo2-mini

topology:
  nodes:
    leaf01:
      kind: linux
      image: networkop/cx:4.3.0
    leaf02:
      kind: linux
      image: networkop/cx:4.3.0
...

  links:
    - endpoints: ["leaf01:swp1", "server01:eth1"]
    - endpoints: ["leaf01:swp2", "server02:eth1"]
    - endpoints: ["leaf01:swp3", "server03:eth1"]
    - endpoints: ["leaf02:swp1", "server01:eth2"]
    - endpoints: ["leaf02:swp2", "server02:eth2"]
    - endpoints: ["leaf02:swp3", "server03:eth2"]
    - endpoints: ["leaf01:swp49", "leaf02:swp49"]
    - endpoints: ["leaf01:swp50", "leaf02:swp50"]

The entire lab can be spun up with a single command in under 20 seconds (on a 10th gen i7 in WSL2):

$ sudo containerlab deploy -t cldemo2.yaml

At the end of the deploy action, containerlab generates an Ansible inventory file which, with a few minor modifications, can be re-used for the Cumulus Ansible modules. At this stage, I was able to test any of the 4 available EVPN-based designs, swap them around with just a few commands and it all had taken me just a few hours to build. This is where my luck has run out…

The Trough of Disillusionment

The first few topologies I’d spun up and tested worked pretty well out of the box, however I did notice that my fans were spinning like crazy. Upon further examination, I had noticed that the clagd (MLAG daemon) and neighmgrd (ARP watchdog) were intermittently fighting to take over all available CPU threads while nothing was showing up in the logs. That’s when I decided to have a look at the peerlink, thankfully it was super easy to do ip netns exec FOO tcpdump from my WSL2 VM. When I saw hundreds of lines flying on my screen in the next few seconds, I realised it was a L2 loop (it turned out all of the packets were ARP).

At this point, it is worth mentioning that one of the hacks/workarounds I had to implement when building the image was stubbing out the mstpd (it wasn’t able to take over the bridge’s STP control plane). At first, I didn’t think too much of it – kernel was still running CSTP and the speed of convergence wasn’t that big of an issue for me. However, as I was digging deeper, I realised that clagd must be communicating with mstpd in order to control the state of the peerlink VLAN interfaces (traffic is never forwarded over the peerlink under normal conditions). That fact alone meant that neither the standard kernel STP implementation nor upstream mstpd would ever be able to cooperate with clagd – there’s no standard for MLAG (although I suspect most implementations are written by the same set of people). My heart sank, at this stage I was ready to give up and admit that there’s no way that one of the most widely deployed features (MLAG) would work inside a container.

It turned out that CL’s version of mstpd is different from the one upstream and relies on a custom bridge kernel module in order to function properly.

However, there was one way to make Cumulus Linux work in a containerised environment and that would be to run it over a native Cumulus Kernel which, as I discovered later, was very heavily patched. So, in theory, I could run tests on a beefy Cumulus VX VM with all services but docker turned off but that would be a big ask and not a nice UX I was hoping for…

Slope of Enlightenment

This is when I thought about the Firecracker – the lightweight VM manager released by AWS to run Lambda and Fargate services (originally based on the work of the Chromium OS team). I’d started looking at the potential candidates for FC VM orchestration and got very excited when I saw both firecracker-containerd and kata-containers support multiple network interface with tc redirect, the same technology that’s used by containerlab to run vrnetlab-based images.

However, both of these candidates relied on virtio VM Sockets as the communication channel with a VM, which just happened to be one of the features disabled in Cumulus Linux kernel. So the next option I looked at was Weavework’s Ignite and, to my surprise, it worked! I was able to boot the same container image using ignite CLI instead of Docker:

sudo ignite run --runtime docker --name test --kernel-image networkop/kernel networkop/cx

The kernel image is built from two layers borrowed from an existing Cumulus VX VM – an uncompressed kernel image and the entire /lib/modules directory containing loadable kernel modules. The resulting image layer stack looked like this:

Finally, I was able to test and confirm that all of the worked-around features that didn’t work in a pure container environment worked with ignite. This was a promising first step but there were still a number of key features missing in both containerlab and ignite that needed to be addressed next:

In order to gracefully introduce ignite, containerlab’s code had to be refactored to support multiple container runtimes [DONE]
In order to support custom interface naming, containerlab had to control the assignment of interface MAC addresses [DONE]
Ignite needed to be extended to support multiple interfaces and stitch them with tc redirect [PR is merged]
A new ignite runtime needs to be added to containerlab [DONE]

One obvious question could be – is any of this worth the effort? Personally, I had learned so much in the process that my ROI has made it well worth it. For others, I have tried to summarise some of the main reasons why anyone would use containerised Firecracker VMs vs traditional qemu-based VMs in the table below:

Feature	Legacy VMs	Ignite VMs
UX	Complex – Vagrant, Libvirt	Simple – containerlab, ignite
API	Legacy, QMP	Modern, OpenAPI
Images	Multiple formats, mutable	OCI-standard, immutable
Startup configuration	Ansible, interactive	Mounting files from host OS
Distribution	Individual file sharing	Container registries
Startup time	Tens of seconds	Seconds
Scale-out	Complex and static	Standard and dynamic

In addition to this, Firecracker’s official website provides a list of benefits and FAQ covering some of the differences with QEMU.

Plateau of Productivity

Although the final stage is still a fair way out, the good news is that I have a stable working prototype that can reliably build Cumulus-based labs so, hopefully, it’s only a matter of time before all of the PRs get merged and this functionality becomes available upstream. I also hope this work demonstrates the possibility for other NOSs to ship their virtualised versions as OCI images bundled together with their custom kernels.

In the meantime, if you’re interested, feel free to reach out to me and I’ll try to help you get started using containerised Cumulus Linux both on a single node with containerlab and, potentially, even use it for large-scale simulations on top of Kubernetes.

July Updates

Although it took me a lot longer than I anticipated, I’ve managed to merge all of my changes upstream:

Ignite now supports connecting arbitrary number of extra interfaces defined in VM’s annotations. This opens up possibilities beyond the original network simulation use case, allowing Firecracker micro-VMs to transparently interconnect with any interfaces on the host (e.g. via SR-IOV CNI).
Containerlab release 0.15 now includes a special cvx node that spins up a containerised Cumulus Linux which can be integrated with any number of the supported nodes for multi-vendor labs and interop testing. I’ve also included a number of labs with different configurations covering everything from the basics of Cumulus Linux operation (CTD) all the way to advanced scenarios like symmetric EVPN with MLAG and MLAG-free multi-homing.

Getting Started with eBPF and Go

Mon, 08 Mar 2021 00:00:00 +0000

eBPF has a thriving ecosystem with a plethora of educational resources both on the subject of eBPF itself and its various application, including XDP. Where it becomes confusing is when it comes to the choice of libraries and tools to interact with and orchestrate eBPF. Here you have to select between a Python-based BCC framework, C-based libbpf and a range of Go-based libraries from Dropbox, Cilium, Aqua and Calico. Another important area that is often overlooked is the “productionisation” of the eBPF code, i.e. going from manually instrumented examples towards production-grade applications like Cilium. In this post, I’ll document some of my findings in this space, specifically in the context of writing a network (XDP) application with a userspace controller written in Go.

Choosing an eBPF library

In most cases, an eBPF library is there to help you achieve two things:

Load eBPF programs and maps into the kernel and perform relocations, associating an eBPF program with the correct map via its file descriptor.
Interact with eBPF maps, allowing all the standard CRUD operations on the key/value pairs stored in those maps.

Some libraries may also help you attach your eBPF program to a specific hook, although for networking use case this may easily be done with any existing netlink API library.

When it comes to the choice of an eBPF library, I’m not the only one confused (see [1],[2]). The truth is each library has its own unique scope and limitations:

Calico implements a Go wrapper around CLI commands made with bpftool and iproute2.
Aqua implements a Go wrapper around libbpf C library.
Dropbox supports a small set of programs but has a very clean and convenient user API.
IO Visor’s gobpf is a collection of go bindings for the BCC framework which has a stronger focus on tracing and profiling.
Cilium and Cloudflare are maintaining a pure Go library (referred to below as libbpf-go) that abstracts all eBPF syscalls behind a native Go interface.

For my network-specific use case, I’ve ended up using libbpf-go due to the fact that it’s used by Cilium and Cloudflare and has an active community, although I really liked (the simplicity of) the one from Dropbox and could’ve used it as well.

In order to familiarise myself with the development process, I’ve decided to implement an XDP cross-connect application, which has a very niche but important use case in network topology emulation. The goal is to have an application that watches a configuration file and ensures that local interfaces are interconnected according to the YAML spec from that file. Here is a high-level overview of how xdp-xconnect works:

The following sections will describe the application build and delivery process step-by-step, focusing more on integration and less on the actual code. Full code for xdp-xconnect is available on Github.

Step 1 - Writing the eBPF code

Normally this would be the main section of any “Getting Started with eBPF” article, however this time it’s not the focus. I don’t think I can help others learn how to write eBPF, however, I can refer to some very good resources that can:

Generic eBPF theory is covered in a lot of details on ebpf.io and Cilium’s eBPF and XDP reference guide.
The best place for some hands-on practice with eBPF and XDP is the xdp-tutorial. It’s an amazing resource that is definitely worth reading even if you don’t end up doing the assignments.
Cilium source code and it’s analysis in [1] and [2].

My eBPF program is very simple, it consists of a single call to an eBPF helper function , which redirects all packets from one interface to another based on the index of the incoming interface.

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

SEC("xdp")
int  xdp_xconnect(struct xdp_md *ctx)
{
    return bpf_redirect_map(&xconnect_map, ctx->ingress_ifindex, 0);
}

In order to compile the above program, we need to provide search paths for all the included header files. The easiest way to do that is to make a copy of everything under linux/tools/lib/bpf/, however, this will include a lot of unnecessary files. So an alternative is to create a list of dependencies:

$ clang -MD -MF xconnect.d -target bpf -I ~/linux/tools/lib/bpf -c xconnect.c

Now we can make a local copy of only a small number of files specified in xconnect.d and use the following command to compile eBPF code for the local CPU architecture:

$ clang -target bpf -Wall -O2 -emit-llvm -g -Iinclude -c xconnect.c -o - | \
llc -march=bpf -mcpu=probe -filetype=obj -o xconnect.o

The resulting ELF file is what we’d need to provide to our Go library in the next step.

Step 2 - Writing the Go code

Compiled eBPF programs and maps can be loaded by libbpf-go with just a few instructions. By adding a struct with ebpf tags we can automate the relocation procedure so that our program knows where to find its map.

spec, err := ebpf.LoadCollectionSpec("ebpf/xconnect.o")
if err != nil {
  panic(err)
}

var objs struct {
	XCProg  *ebpf.Program `ebpf:"xdp_xconnect"`
	XCMap   *ebpf.Map     `ebpf:"xconnect_map"`
}
if err := spec.LoadAndAssign(&objs, nil); err != nil {
	panic(err)
}
defer objs.XCProg.Close()
defer objs.XCMap.Close()

Type ebpf.Map has a set of methods that perform standard CRUD operations on the contents of the loaded map:

err = objs.XCMap.Put(uint32(0), uint32(10))

var v0 uint32
err = objs.XCMap.Lookup(uint32(0), &v0)

err = objs.XCMap.Delete(uint32(0))

The only step that’s not covered by libbpf-go is the attachment of programs to network hooks. This, however, can easily be accomplished by any existing netlink library, e.g. vishvananda/netlink, by associating a network link with a file descriptor of the loaded program:

link, err := netlink.LinkByName("eth0")
err = netlink.LinkSetXdpFdWithFlags(*link, c.objs.XCProg.FD(), 2)

Note that I’m using the SKB_MODE XDP flag to work around the exiting veth driver caveat. Although the native XDP mode is considerably faster than any other eBPF hook, SKB_MODE may not be as fast due to the fact that packet headers have to be pre-parsed by the network stack (see video).

Step 3 - Code Distribution

At this point everything should have been ready to package and ship our application if it wasn’t for one problem - eBPF code portability. Historically, this process involved copying of the eBPF source code to the target platform, pulling in the required kernel headers and compiling it for the specific kernel version. This problem is especially pronounced for tracing/monitoring/profiling use cases which may require access to pretty much any kernel data structure, so the only solution is to introduce another layer of indirection (see CO-RE).

Network use cases, on the other hand, rely on a relatively small and stable subset of kernel types, so they don’t suffer from the same kind of problems as their tracing and profiling counterparts. Based on what I’ve seen so far, the two most common code packaging approaches are:

Ship eBPF code together with the required kernel headers, assuming they match the underlying kernel (see Cilium).
Ship eBPF code and pull in the kernel headers on the target platform.

In both of these cases, the eBPF code is still compiled on that target platform which is an extra step that needs to be performed before the user-space application can start. However, there’s an alternative, which is to pre-compile the eBPF code and only ship the ELF files. This is exactly what can be done with bpf2go, which can embed the compiled code into a Go package. It relies on go generate to produce a new file with compiled eBPF and libbpf-go skeleton code, the only requirement being the //go:generate instruction. Once generated though, our eBPF program can be loaded with just a few lines (note the absence of any arguments):

specs, err := newXdpSpecs()
objs, err := specs.Load(nil)

The obvious benefit of this approach is that we no longer need to compile on the target machine and can ship both eBPF and userspace Go code in a single package or Go binary. This is great because it allows us to use our application not only as a binary but also import it into any 3rd party Go applications (see usage example).

Reading and Interesting References

Generic Theory:
https://github.com/xdp-project/xdp-tutorial
https://docs.cilium.io/en/stable/bpf/
https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/

BCC and libbpf:
https://facebookmicrosites.github.io/bpf/blog/2020/02/20/bcc-to-libbpf-howto-guide.html
https://nakryiko.com/posts/libbpf-bootstrap/
https://pingcap.com/blog/why-we-switched-from-bcc-to-libbpf-for-linux-bpf-performance-analysis
https://facebookmicrosites.github.io/bpf/blog/

eBPF/XDP performance:
https://www.netronome.com/blog/bpf-ebpf-xdp-and-bpfilter-what-are-these-things-and-what-do-they-mean-enterprise/

Linus Kernel Coding Style:
https://www.kernel.org/doc/html/v5.9/process/coding-style.html

libbpf-go example programs:
https://github.com/takehaya/goxdp-template
https://github.com/hrntknr/nfNat
https://github.com/takehaya/Vinbero
https://github.com/tcfw/vpc
https://github.com/florianl/tc-skeleton
https://github.com/cloudflare/rakelimit
https://github.com/b3a-dev/ebpf-geoip-demo

bpf2go:
https://github.com/lmb/ship-bpf-with-go
https://pkg.go.dev/github.com/cilium/ebpf/cmd/bpf2go

XDP example programs:
https://github.com/cpmarvin/lnetd-ctl
https://gitlab.com/mwiget/crpd-l2tpv3-xdp

Building your own SD-WAN with Envoy and Wireguard

Sat, 13 Feb 2021 00:00:00 +0000

When using a personal VPN at home, one of the biggest problems I’ve faced was the inability to access public streaming services. I don’t care about watching Netflix from another country, I just want to be able to use my local internet connection for this kind of traffic while still encrypting everything else. This problem is commonly known in network engineering as “local internet breakout” and is often implemented at remote branch/edge sites to save costs of transporting SaaS traffic (e.g. Office365) over the VPN infrastructure. These “local breakout” solutions often rely on explicit enumeration of all public IP subnets, which is a bit cumbersome, or require “intelligent” (i.e. expensive) DPI functionality. However, it is absolutely possible to build something like this for personal use and this post will demonstrate how to do that.

Solution Overview

The problem scope consists of two relatively independent areas:

Traffic routing - how to forward traffic to different outgoing interfaces based on the target domain.
VPN management - how to connect to the best VPN gateway and make sure that connection stays healthy.

Each of one these problem areas is addressed by a separate set of components.

VPN management is solved by:

A smart-vpn-client agent that discovers all of the available VPN gateways, connects to the closest one and continuously monitors the state of that connection.

Traffic routing is solved by:

A transparent proxy (Envoy), capable of domain- and SNI-based routing and binding to multiple outgoing interfaces.
A proxy controller called envoy-split-proxy, that monitors the user intent (what traffic to route where) and ensures that Envoy configuration is updated accordingly.

An extra bonus is a free-tier monitoring solution based on Grafana Cloud that scrapes local metrics and pushes them to the managed observability platform.

Below, I’ll walk through the component design and steps of how to deploy this solution on a Linux-based ARM64 box (in my case it’s a Synology NAS). The only two prerequisites that are not covered in this blogpost are:

Docker support on the target ARM64 box (see this guide for Synology)
Wireguard kernel module loaded on the target ARM64 box (see this guide for Synology)

Smart VPN Client

At its core, the smart-vpn-client implements a standard set of functions you’d expect from a VPN client, i.e.:

Discovers all of the available VPN gateways (exit nodes) it can connect to.
Measures the latency and selects the “closest” gateway for higher throughput.
Configures the wireguard interface and associated routing policies.

The only supported VPN provider at this stage is PIA, so the discovery and VPN setup is based on the instructions from the pia-foss repo.

The “smart” functionality is designed to maintain a consistent user experience in the presence of network congestion and VPN gateway overloading and it does that by resetting a VPN connection if it becomes too slow or unresponsive. Translated to technical terms, this is implemented as the following sequence of steps :

When a new VPN connection is set up, we record the “baseline” round-trip time over it.
Connection health monitor periodically measures the RTT and maintains a record of the last 10 values.
At the end of each measurement, connection health is evaluated and can be deemed degraded if either:
- No response was received within a timeout window of 10s.
- The exponentially weighted average of the last 10 measurements exceeded 10x the “baseline”.
If health stays degraded for 3 consecutive measurement intervals, the VPN connection is re-established to the new “closest” VPN gateway.

The VPN client binary can be built from source or downloaded as a docker image, which is how I’m deploying it:

#!/bin/sh
docker pull networkop/smart-vpn-client

docker rm -f vpn
docker run --privileged networkop/smart-vpn-client -cleanup
docker run -d --name vpn --restart always --net host \
--env VPN_PWD=<VPN-PASSWORD> \
--privileged \
networkop/smart-vpn-client \
-user <VPN-USER> -ignore=uk_2

The above script creates a new container attached to the root network namespace. We can see the main steps it went through in the logs:

$ docker logs vpn
level=info msg="Starting VPN Connector"
level=info msg="Ignored headends: [uk_2]"
level=info msg="VPN provider is PIA"
level=info msg="Discovering VPN headends for PIA"
level=info msg="Winner is uk with latency 14 ms"
level=info msg="Brining up WG tunnel to 143.X.X.X:1337"
level=info msg="Wireguard Tunnel is UP"
level=info msg="New baseline is 202 ms; Threshold is 2020"

Now we can verify that the wireguard tunnel has been set up:

$ sudo wg show
interface: wg-pia
  public key: MY_PUBLIC_KEY
  private key: (hidden)
  listening port: 34006
  fwmark: 0xea55

peer: PEER_PUBLIC_KEY
  endpoint: 143.X.X.X:1337
  allowed ips: 0.0.0.0/0
  latest handshake: 1 minute, 21 seconds ago
  transfer: 3.29 GiB received, 1.03 GiB sent
  persistent keepalive: every 15 seconds

Envoy Split Proxy

Split tunneling is a technique commonly used in VPN access to enable local internet breakout for some subset of user traffic. It works at Layer 3, so the decision is made based on the contents of a local routing table. What I’ve done with Envoy is effectively taken the same idea and extended it to L4-L7, hence the name split proxy. The goal was to make L4-L7 split-routing completely transparent to the end-user, with no extra requirements (e.g. no custom proxy configuration) apart from a default route pointing at the ARM64 box. This goal is achieved by a combination of:

Envoy proxy acting as a configurable dataplane for L4-L7 traffic.
IPTables redirecting all inbound TCP/80 and TCP/443 traffic to envoy listeners.
XDS controller that configures envoy to act as a transparent forward proxy based on the user intent.

The user intent is expressed as a YAML file with the list of domains and the non-default interface to bind to when making outgoing requests. This file is watched by the envoy-split-proxy application and applied to envoy on every detected change.

interface: eth0
urls:
## Netflix
- netflix.com
- "*.nflxso.net"

All other domains will be proxied and sent out the default (wireguard) interface, so the above file only defines the exceptions. One obvious problem is that streaming services will most likely use a combination of domains, not just their well-known second-level domains. The domain discovery process may be a bit tedious but only needs to be done once for a single streaming service. Some of the domains that I use are already documented in the source code repository.

Similar to the VPN client, all software can be deployed directly on ARM64 box as binaries, or as docker containers. Regardless of the deployment method, the two prerequisites are the user intent YAML file and the Envoy bootstrap configuration that makes it connect to the XDS controller.

$ curl -O https://raw.githubusercontent.com/networkop/envoy-split-proxy/main/envoy.yaml
$ curl -O https://raw.githubusercontent.com/networkop/envoy-split-proxy/main/split.yaml

With those files in the pwd we can spin up the two docker containers with the following script:

#!/bin/sh

docker pull networkop/envoy-split-proxy
docker pull envoyproxy/envoy:v1.16.2

docker rm -f app
docker rm -f envoy

docker run -d --name app --restart always --net host \
-v $(pwd)/split.yaml:/split.yaml \
networkop/envoy-split-proxy \
-conf /split.yaml

docker run -d --name envoy --restart always --net host \
-v $(pwd)/envoy.yaml:/etc/envoy/envoy.yaml \
envoyproxy/envoy:v1.16.2 \
--config-path /etc/envoy/envoy.yaml \

Finally, all transit traffic needs to get redirected to envoy with a couple of iptable rules:

#!/bin/sh
sudo iptables -t nat -D PREROUTING -p tcp --dport 443 -j REDIRECT --to-port 10000
sudo iptables -t nat -D PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 10001

sudo iptables -t nat -A PREROUTING -p tcp --dport 443 -j REDIRECT --to-port 10000
sudo iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 10001

Monitoring

Observability is the critical part of any “software-defined” networking product, so our solution shouldn’t be an exception. It’s even easier when we don’t have to manage it ourselves. Thanks to Grafana Cloud’s forever free plan, all we have to do is deploy a grafana agent and scrape metrics exposed by envoy and smart-vpn-client. In order to save on resource utilisation (both local and cloud), I’ve disabled some of the less interesting collectors and dropped most of the envoy metrics, so that the final configuration file looks like this:

integrations:
  node_exporter:
    enabled: true
    disable_collectors:
      - bonding
      - infiniband
      - ipvs
      - mdadm
      - nfs
      - nfsd
      - xfs
      - zfs
      - arp
      - btrfs
      - bcache
      - edac
      - entropy
      - pressure
      - rapl
      - softnet
  prometheus_remote_write:
    - basic_auth:
        password: <PWD>
        username: <USERNAME>
      url: https://prometheus.grafana.net/api/prom/push
prometheus:
  configs:
    - name: integrations
      remote_write:
        - basic_auth:
            password: <PWD>
            username: <USERNAME>
          url: https://prometheus.grafana.net/api/prom/push
      scrape_configs:
      - job_name: vpn
        scrape_interval: 5s
        static_configs:
        - targets: ['localhost:2112']
      - job_name: envoy
        metrics_path: /stats/prometheus
        metric_relabel_configs:
        - source_labels: [__name__]
          regex: ".+_ms_bucket"
          action: keep
        - source_labels: [envoy_cluster_name]
          regex: "xds_cluster"
          action: drop
        static_configs:
        - targets: ['localhost:19000']
  global:
    scrape_interval: 15s

The script to enable grafana agent simply mounts the above configuration file and points the agent at it:

#!/bin/sh

docker rm -f agent
docker run -d --name agent --restart always --net host \
-v /tmp/grafana-agent-wal:/etc/agent \
-v $(pwd)/config.yaml:/etc/agent-config/agent.yaml \
grafana/agent:v0.12.0 --config.file=/etc/agent-config/agent.yaml --prometheus.wal-directory=/etc/agent/data

The collected metrics can be displayed in a beautiful dashboard allowing us to correlate network throughput, VPN healthchecks and proxy connection latencies.

Credits

Building something like this would have been a lot more difficult without other FOSS projects:

Envoy proxy - the most versatile and feature-rich proxy in the world today.
Wireguard and wgctrl Go package to manage all interface-related configurations.
Grafana Cloud’s with their free tier plan which is a perfect fit for personal/home use.

Self-hosted external DNS resolver for Kubernetes

Fri, 14 Aug 2020 00:00:00 +0000

There comes a time in the life of every Kubernetes cluster when internal resources (pods, deployments) need to be exposed to the outside world. Doing so from a pure IP connectivity perspective is relatively easy as most of the constructs come baked-in (e.g. NodePort-type Services) or can be enabled with an off-the-shelf add-on (e.g. Ingress and LoadBalancer controllers). In this post, we’ll focus on one crucial piece of network connectivity which glues together the dynamically-allocated external IP with a static customer-defined hostname — a DNS. We’ll examine the pros and cons of various ways of implementing external DNS in Kubernetes and introduce a new CoreDNS plugin that can be used for dynamic discovery and resolution of multiple types of external Kubernetes resources.

External Kubernetes Resources

Let’s start by reviewing various types of “external” Kubernetes resources and the level of networking abstraction they provide starting from the lowest all the way to the highest level.

One of the most fundamental building block of all things external in Kubernetes is the NodePort service. It works by allocating a unique external port for every service instance and setting up kube-proxy to deliver incoming packets from that port to the one of the healthy backend pods. This service is rarely used on its own and was designed to be a building block for other higher-level resources.

Next level up is the LoadBalancer service which is one of the most common ways of exposing services externally. This service type requires an extra controller that will be responsible for IP address allocation and delivering traffic to the Kubernetes nodes. This function can be implemented by cloud load-balancers, in case the cluster is deployed one of the public clouds, a physical appliance or a cluster add-on like MetalLB or Porter.

At the highest level of abstraction is the Ingress resource. It, too, requires a dedicated controller which spins up and configures a number of proxy servers that can act as a L7 load-balancer, API gateway or, in some cases, a L4 (TCP/UDP) proxy. Similarly to the LoadBalancer, Ingress may be implemented by one of the public cloud L7 load-balancers or could be self-hosted by the cluster using any one of the open-source ingress controllers. Amongst other things, Ingress controllers can perform TLS offloading and named-based routing which rely heavily on external DNS infrastructure that can dynamically discover Ingress resources as they get added/removed from the cluster.

There are other external-ish resources like ExternalName services and even ClusterIP in certain cases. They represent a very small subset of corner case scenarios and are considered outside of the scope of this article. Instead, we’ll focus on the two most widely used external resources—LoadBalancers and Ingresses, and see how they can be integrated into the public DNS infrastructure.

ExternalDNS

The most popular solution today is the ExternalDNS controller. It works by integrating with one of the public DNS providers and populates a pre-configured DNS zone with entries extracted from the monitored objects, e.g. Ingress’s spec.rules[*].host or Service’s external-dns.alpha.kubernetes.io/hostname annotations. In addition, it natively supports non-standard resources like Istio’s Gateway or Contour’s IngressRoute which, together with the support for over 15 cloud DNS providers, makes it a default choice for anyone approaching this problem for the first time.

ExternalDNS is an ideal solution for Kubernetes clusters under a single administrative domain, however, it does have a number of trade-offs that start to manifest themselves when a cluster is shared among multiple tenants:

Multiple DNZ zones require a dedicated ExternalDNS instance per zone.
Each new zone requires cloud-specific IAM rules to be set up to allow ExternalDNS to make the required changes.
Unless managing a local cloud DNS, API credentials will need to be stored as a secret inside the cluster.

In addition to the above, ExternalDNS represents another layer of abstraction and complexity outside of the cluster that needs to be considered during maintenance and troubleshooting. Every time the controller fails, there’s a possibility of some stale state to be left, accumulating over time and polluting the hosted DNS zone.

CoreDNS’s `k8s_external` plugin

An alternative approach is to make internal Kubernetes DNS add-on respond to external DNS queries. The prime example of this is the CoreDNS k8s_external plugin. It works by configuring CoreDNS to respond to external queries matching a number of pre-configured domains. For example, the following configuration will allow it to resolve queries for svc2.ns.mydomain.com, as shown in the diagram above, as well as the svc2.ns.example.com domain:

k8s_external mydomain.com example.com

Both queries will return the same set of IP addresses extracted from the .status.loadBalancer field of the svc2 object.

These domains will still need to be delegated, which means you will need to expose CoreDNS externally with service type LoadBalancer and update NS records with the provisioned IP address.

Under the hood, k8s_external relies on the main kubernetes plugin and simply re-uses information already collected by it. This presents a problem when trying to add extra resources (e.g. Ingresses, Gateways) as these changes will increase the amount of information the main plugin needs to process and will inevitably affect its performance. This is why there’s a new plugin now that’s designed to absorb and extend the functionality of the k8s_external.

The new `k8s_gateway` CoreDNS plugin

This out-of-tree plugin is loosely based on the k8s_external and maintains a similar configuration syntax, however it does contain a few notable differences:

It doesn’t rely on any other plugin and uses its own mechanism of Kubernetes object discovery.
It’s designed to be used alongside (and not replace) an existing internal DNS plugin, be it kube-dns or CoreDNS.
It doesn’t collect or expose any internal cluster IP addresses.
It supports both LoadBalancer services and Ingresses with an eye on the service API’s HTTPRoute when it becomes available.

The way it’s designed to be used can be summarised as follows:

The scope of the plugin is controlled by a set of RBAC rules and by default is limited to List/Watch operations on Ingress and Service resources.
The plugin is built as a CoreDNS binary and run as a deployment.
This deployment is exposed externally and the required domains are delegated to the address of the external load-balancer.
Any DNS query that reaches the k8s_gateway plugin will go through the following stages:
- First, it will be matched against one of the zones configured for this plugin in the Corefile.
- If there’s a hit, the next step is to match it against any of the existing Ingress resources. The lookup is performed against FQDNs configured in spec.rules[*].host fields of the Ingress.
- At this stage, the result can be returned to the user with IPs collected from the .status.loadBalancer.ingress.
- If no matching Ingress was found, the search continues with the Services objects. Since services don’t really have domain names, the lookup is performed using the serviceName.namespace as the key.
- If there’s a match, it is returned to the end-user in a similar way, alternatively the plugin responds with NXDOMAIN.

The design of the k8s_gateway plugin attempts to address some of the issues of other solutions described above, but also brings a number of extra advantages:

All external DNS entries and associated state are contained within the Kubernetes cluster while the hosted zone only contains a single NS record.
You get the power and flexibility of the full suite of CoreDNS’s internal and external plugins, e.g. you can use ACL to control which source IPs are (not)allowed to make queries.
Provisioning that doesn’t rely on annotations makes it easier to maintain Kubernetes manifests.
Separate deployment means that internal DNS resolution is not affected in case external DNS becomes overloaded.
Since API keys are not stored in the cluster, it makes it easier and safer for new tenants to bring their own domain.
Federated Kubernetes cluster deployments (e.g. using Cluster API) become easier as there’s only a single entrypoint via the management cluster and each workload cluster can get its own self-hosted subdomain.

The k8s_gateway is developed as an out-of-tree plugin under an open-source license. Community contributions in the form of issues, pull requests and documentation are always welcomed.

Anatomy of the "kubernetes.default"

Mon, 29 Jun 2020 00:00:00 +0000

Every Kubernetes cluster is provisioned with a special service that provides a way for internal applications to talk to the API server. However, unlike the rest of the components that get spun up by default, you won’t find the definition of this service in any of the static manifests and this is just one of the many things that make this service unique.

The Special One

To make sure we’re on the same page, I’m talking about this:

$ kubect get svc kubernetes -n default
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   161m

This service is unique in many ways. First, as you may have noticed, it always occupies the first available IP in the Cluster CIDR, a.k.a. --service-cluster-ip-range.

Second, this service is invincible, i.e. it will always get re-created, even when it’s manually removed:

$ kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   118s
$ kubectl delete svc kubernetes
service "kubernetes" deleted
$ kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   0s

You may notice that it comes up with the same ClusterIP, regardless of how many services may already exist in the cluster.

Third, this service does not have any matching pods, however it does have a fully populated Endpoints object:

$ kubectl get pod --selector component=apiserver --all-namespaces
No resources found
$ kubectl get endpoints kubernetes
NAME         ENDPOINTS                                         AGE
kubernetes   172.18.0.2:6443,172.18.0.3:6443,172.18.0.4:6443   4m16s

This last bit is perhaps the most curious one. How can a service have a list of endpoints when there are no pods that match this service’s label selector? This goes against how services controller works. Note that this behaviour is true even for managed kubernetes clusters, where the API server is run by the provider (e.g. GKE).

Finally, the IP and Port of this service get injected into every pod as environment variables:

KUBERNETES_SERVICE_HOST=10.96.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443

These values can later be used by k8s controllers to configure the client-go’s rest interface that is used to establish connectivity to the API server:

func InClusterConfig() (*Config, error) {

  host := os.Getenv("KUBERNETES_SERVICE_HOST"), 
  port := os.Getenv("KUBERNETES_SERVICE_PORT")

  return &Config{
		Host: "https://" + net.JoinHostPort(host, port),
  }

Controller of controllers

To find out who’s behind this magical service, we need to look at the code for the k/k’s master controller, that is described as the “controller manager for the core bootstrap Kubernetes controller loops”, meaning it’s one of the first controllers that gets spun up by the API server binary. Let’s break it down into smaller pieces and see what’s going on inside it.

When the controller is started, it spins up a runner, which is a group of functions that run forever until they receive a stop signal via a channel.

// Start begins the core controller loops that must exist for bootstrapping
// a cluster.
func (c *Controller) Start() {
  
	c.runner = async.NewRunner(c.RunKubernetesNamespaces, c.RunKubernetesService, repairClusterIPs.RunUntil, repairNodePorts.RunUntil)
	c.runner.Start()
}

The most interesting is the second function - RunKubernetesService(), which is a control loop that constantly updates the default kubernetes service.

// RunKubernetesService periodically updates the kubernetes service
func (c *Controller) RunKubernetesService(ch chan struct{}) {

	if err := c.UpdateKubernetesService(false); err != nil {
		runtime.HandleError(fmt.Errorf("unable to sync kubernetes service: %v", err))
	}
}

Most of the work is done by the UpdateKubernetesService(). This function does three things:

Creates the “default” namespace whose name is defined in the metav1.NamespaceDefault variable.
Creates/Updates the default kuberentes service.
Creates/Updates the endpoints resource for this service.

// UpdateKubernetesService attempts to update the default Kube service.
func (c *Controller) UpdateKubernetesService(reconcile bool) error {

	if err := createNamespaceIfNeeded(c.NamespaceClient, metav1.NamespaceDefault); err != nil {
		return err
   }

	if err := c.CreateOrUpdateMasterServiceIfNeeded(kubernetesServiceName, c.ServiceIP, servicePorts, serviceType, reconcile); err != nil {
		return err
	}

	if err := c.EndpointReconciler.ReconcileEndpoints(kubernetesServiceName, c.PublicIP, endpointPorts, reconcile); err != nil {
		return err
	}

	return nil
}

Finally, the CreateOrUpdateMasterServiceIfNeeded() function is where the default service is being built. You can see the skeleton of this service’s object in the below snippet:

const kubernetesServiceName = "kubernetes"

// CreateOrUpdateMasterServiceIfNeeded will create the specified service if it
// doesn't already exist.
func (c *Controller) CreateOrUpdateMasterServiceIfNeeded(serviceName string, serviceIP net.IP, servicePorts []corev1.ServicePort, serviceType corev1.ServiceType, reconcile bool) error {

	svc := &corev1.Service{
		ObjectMeta: metav1.ObjectMeta{
			Name:      serviceName,
			Namespace: metav1.NamespaceDefault,
			Labels:    map[string]string{"provider": "kubernetes", "component": "apiserver"},
		},
		Spec: corev1.ServiceSpec{
			Ports: servicePorts,
			// maintained by this code, not by the pod selector
			Selector:        nil,
			ClusterIP:       serviceIP.String(),
			SessionAffinity: corev1.ServiceAffinityNone,
			Type:            serviceType,
		},
	}

	_, err := c.ServiceClient.Services(metav1.NamespaceDefault).Create(context.TODO(), svc, metav1.CreateOptions{})

	return err
}

The code above explains why this service can never be completely removed from the cluster - the master controller loop will always recreate it if it’s missing, along with its endpoints object. However, this still doesn’t explain how the IP for this service is selected nor where the endpoint IPs are coming from. In order to do this, we need to get a deeper look at how the API server builds its runtime configuration.

Always the first

One of the interesting qualities of the ClusterIP of the kubernetes.default is that it always (unless manually overridden) occupies the first IP in the Cluster CIDR. The answer is hidden in the ServiceIPRange() function of the master controller’s service.go:


func ServiceIPRange(passedServiceClusterIPRange net.IPNet) (net.IPNet, net.IP, error) {

	size := integer.Int64Min(utilnet.RangeSize(&serviceClusterIPRange), 1<<16)
	if size < 8 {
		return net.IPNet{}, net.IP{}, fmt.Errorf("the service cluster IP range must be at least %d IP addresses", 8)
	}

	// Select the first valid IP from ServiceClusterIPRange to use as the GenericAPIServer service IP.
	apiServerServiceIP, err := utilnet.GetIndexedIP(&serviceClusterIPRange, 1)
	if err != nil {
		return net.IPNet{}, net.IP{}, err
	}

	return serviceClusterIPRange, apiServerServiceIP, nil
}

This function gets called when the master controller is started and hard-codes the service IP for the default service to the first IP of the range. Another interesting fact in this function is that it always checks that the Cluster IP range is at least /29, which fits 6 usable addresses in the worst case. The latter can probably be explained by the fact that the next size down is /30, which doesn’t leave much room for user-defined clusterIPs after the kubernetes.default and kube-dns.kube-system are configured, so in the smallest possible cluster you can at least configure a few non-default services before you run out of IPs.

Endpoint IPs

The way endpoint addresses are populated is different between managed (GKE, AKS, EKS) and non-managed clusters. Let’s first have a look at a highly-available kind cluster:

$ kubectl describe svc kubernetes | grep Endpoints
Endpoints:         172.18.0.3:6443,172.18.0.4:6443,172.18.0.7:6443

Bearing in mind that by default kind would use 10.244.0.0/16 as the pod IP range and 10.96.0.0/12 as the cluster IP range, these IPs don’t make a lot of sense. However, since kind uses kubeadm under the hood, which spins up control plane components as static pods, we can find API server pods in the kube-system namespace:

kubectl -n kube-system get pod -l tier=control-plane -o wide | grep api
kube-apiserver-kind-control-plane             1/1     Running   172.18.0.3
kube-apiserver-kind-control-plane2            1/1     Running   172.18.0.4
kube-apiserver-kind-control-plane3            1/1     Running   172.18.0.7

If we check the manifest of any of the above pods, we’ll see that they are run with hostNetwork: true and those IP come from the underlying containers that kind uses as nodes. As a part of the UpdateKubernetesService() mentioned above, each API server in the cluster goes and updates the endpoints object with its own IP and Port as defined in the mastercount.go:

func (r *masterCountEndpointReconciler) ReconcileEndpoints(serviceName string, ip net.IP, endpointPorts []corev1.EndpointPort, reconcilePorts bool) error {

	e.Subsets = []corev1.EndpointSubset{{
		Addresses: []corev1.EndpointAddress{{IP: ip.String()}},
		Ports:     endpointPorts,
	}}
	klog.Warningf("Resetting endpoints for master service %q to %#v", serviceName, e)
	_, err = r.epAdapter.Update(metav1.NamespaceDefault, e)
}

With managed Kubernetes clusters, control-plane nodes are not accessible by end users, so it’s harder to say exactly how endpoints are getting populated. However, it’s fairly easy to imagine that a cloud provider spins up a 3-node control-plane with a load-balancer and configures all three API servers with this LB’s IP as the advertise-address. This would results in a single endpoint that represents that managed control-plane load-balancer:

$ kubectl get ep kubernetes
NAME         ENDPOINTS          AGE
kubernetes   172.16.0.2:443   40d

Solving the Ingress Mystery Puzzle

Sat, 13 Jun 2020 00:00:00 +0000

Last week I posted a tweet about a Kubernetes networking puzzle. In this post, we’ll go over the details of this puzzle and uncover the true cause and motive of the misbehaving ingress.

Puzzle recap

Imagine you have a Kubernetes cluster with three namespaces, each with its own namespace-scoped ingress controller. You’ve created an ingress in each namespace that exposes a simple web application. You’ve checked one of them, made sure it works and moved on to other things. However some time later, you get reports that the web app is unavailable. You go to check it again and indeed, the page is not responding, although nothing has changed in the cluster. In fact, you realise that the problem is intermittent - one minute you can access the page, and on the next refresh it’s gone. To make things worse, you realise that similar issues affect the other two ingresses.

If you feel like you’re capable of solving it on your own, feel free to follow the steps in the walkthrough, otherwise, continue on reading. In either case, make sure you’ve setup a local test environment so that it’s easier to follow along:

Clone the ingress-puzzle repo:

git clone https://github.com/networkop/ingress-puzzle && cd ingress-puzzle

Build a local test cluster:
```
make cluster
```
Create three namespaces:
```
make namespaces
```
Create an in-cluster load-balancer (MetalLB) that will allocate IPs from a 100.64.0.0/16 range:
```
make load-balancer
```
In each namespace, install a namespace-scoped ingress controller:
```
make controllers
```
Create three test deployments and expose them via ingresses:
```
make ingresses
```

Ingress controller expected behaviour

In order to solve this puzzle we need to understand how ingress controllers perform their duties, so let’s see how a typical ingress controller does that:

An ingress controller consists of two components - control plane and data plane, which can be run separately or be a part of the same pod/deployment.
Control plane is a k8s controller that uses its pod’s service account to talk to the API server and establishes “watches” on Ingress-type resources.
Data plane is a reverse proxy (e.g. nginx, envoy) that receives traffic from end users and forwards it upstream to one of the backend k8s services.
In order to steer the traffic to the data plane, an external load-balancer service is required, whose address (IP or hostname) is reflected in ingress’s status field.
As Ingress resources get created/deleted, controller updates configuration of its data plane to match the desired state described in those resources.

This sounds simple enough, but as always, the devil is in the details, so let’s start by focusing on one of the namespaces and observe the behaviour of its ingress.

Exhibit #1 - namespace one

Let’s looks at the ingress in namespace one. It seems like a healthy-looking output, the address is set to the 100.64.0.0 which is part of the MetalLB range.

$ kubens one
$ kubectl get ingress
NAME   CLASS    HOSTS   ADDRESS      PORTS   AGE
test   <none>   *       100.64.0.0   80      141m

If you want to test connectivity to the backend deployment, you can add the MetalLB public IP range to the docker bridge like this:

ip=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[0].address}')
device=$(ip -j route get $ip | jq '.[0].dev')
sudo ip addr add 100.64.0.100/16 dev $device

Now you should be able to hit the test nginx deployment:

curl -s 100.64.0.0 | grep Welcome
<title>Welcome to nginx!</title>
<h1>Welcome to nginx!</h1>

Nothing unusual so far, and nothing to indicate intermittent connectivity either. Let’s move on.

Exhibit #2 - namespaces two & three

This output looks a bit weird, the IP in the address field is definitely not a part of the MetalLB range:

$ kubens two
$ kubectl get ingress
NAME   CLASS    HOSTS   ADDRESS      PORTS   AGE
test   <none>   *       172.18.0.2   80      155m

A similar situation can be observed in the other namespace:

$ kubens three
$ kubectl get ingress
NAME   CLASS    HOSTS   ADDRESS      PORTS   AGE
test   <none>   *       172.18.0.2   80      155m

At this point, these outputs don’t make a lot of sense. How can two different ingresses, controlled by two distinct controllers have the same address? And why do they get allocated with a private IP, which is not managed by MetalLB? If we check services across all existing namespaces, there won’t be a single service with IPs from 172.16.0.0/12 range.

kubectl get svc -A | grep 172

Exhibit #4 - flapping addresses

Another one of the reported issues was the intermittent connectivity to some of the ingresses. If we keep watching the ingress in namespace one, we should see something interesting:

kubens one
kubectl get ingress --watch
NAME   CLASS    HOSTS   ADDRESS      PORTS   AGE
test   <none>   *       100.64.0.0   80      141m
test   <none>   *       172.18.0.2   80      141m
test   <none>   *       100.64.0.0   80      142m
test   <none>   *       172.18.0.2   80      142m
test   <none>   *       100.64.0.0   80      143m
test   <none>   *       172.18.0.2   80      143m
test   <none>   *       100.64.0.0   80      144m
test   <none>   *       172.18.0.2   80      144m

It looks like the ingress address is flapping between our “good” MetalLB IP and the same exact IP that the other two ingresses have. Now let’s zoom out a bit and have a look at all three ingresses at the same time:

kubectl get ingress --watch -A
NAMESPACE   NAME   CLASS    HOSTS   ADDRESS      PORTS   AGE
one         test   <none>   *       172.18.0.2   80      150m
three       test   <none>   *       172.18.0.2   80      150m
two         test   <none>   *       172.18.0.2   80      150m
one         test   <none>   *       100.64.0.0   80      150m
three       test   <none>   *       100.64.0.2   80      151m
three       test   <none>   *       172.18.0.2   80      151m
one         test   <none>   *       172.18.0.2   80      151m
one         test   <none>   *       100.64.0.0   80      151m
three       test   <none>   *       100.64.0.2   80      152m
one         test   <none>   *       172.18.0.2   80      152m
three       test   <none>   *       172.18.0.2   80      152m
one         test   <none>   *       100.64.0.0   80      152m

This looks even more puzzling - it seems that all ingresses have addresses that flap continuously. This would definitely explain the intermittent connectivity, however the most important question now is “why”.

Exhibit #5 - controller logs

The most obvious suspect at this stage is the ingress controller, since it’s the one that updates the status of its managed ingress resources. Let stay in the same namespace and look at its logs:

kubectl logs deploy/ingress-ingress-nginx-controller -f

event.go:278] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"one", Name:"test", UID:"7d1e4069-d285-4cf8-ba28-437d0a8fd04d", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"55860", FieldPath:""}): type: 'Normal' reason: 'UPDATE' Ingress one/test

status.go:275] updating Ingress one/test status from [{172.18.0.2 }] to [{100.64.0.0 }]

event.go:278] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"one", Name:"test", UID:"7d1e4069-d285-4cf8-ba28-437d0a8fd04d", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"55870", FieldPath:""}): type: 'Normal' reason: 'UPDATE' Ingress one/test

This doesn’t make a lot of sense - the ingress controller clearly updates the status with the right IP, but why does it get overwritten? and by whom?

Exhibit #5 - cluster-wide logs

At this point, we can allow ourselves a little bit of cheating. Since it’s a test cluster and we’ve only got a few ingresses configured, we can tail logs from all ingress controllers and watch all ingresses at the same time. Don’t forget to install stern.

kubectl get ingress -A -w &
stern --all-namespaces -l app.kubernetes.io/name=ingress-nginx &
three ingress-ingress-nginx-controller-58b79c576b-94v8d controller status.go:275] updating Ingress three/test status from [{172.18.0.2 }] to [{100.64.0.2 }]

three       test   <none>   *       100.64.0.2   80      174m

twothree  ingress-ingress-nginx-controller-5db5984d7d-vljth ingress-ingress-nginx-controller-58b79c576b-94v8d controller event.go:278] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"three", Name:"test", UID:"176f0f8e-d3d5-4476-9b51-2d02c7eb47e2", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"57195", FieldPath:""}): type: 'Normal' reason: 'UPDATE' Ingress three/test
event.go:278] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"three", Name:"test", UID:"176f0f8e-d3d5-4476-9b51-2d02c7eb47e2", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"57195", FieldPath:""}): type: 'Normal' reason: 'UPDATE' Ingress three/test

two ingress-ingress-nginx-controller-5db5984d7d-vljth controller status.go:275] updating Ingress one/test status from [{100.64.0.0 }] to [{172.18.0.2 }]
two ingress-ingress-nginx-controller-5db5984d7d-vljth controller status.go:275] updating Ingress three/test status from [{100.64.0.2 }] to [{172.18.0.2 }]

three       test   <none>   *       172.18.0.2   80      174m

Whodunit

So, it looks like the culprit is the ingress controller in namespace two and it tries to change status fields of all three ingresses. Now it’s safe to look at exactly how it was installed, and this is the helm values file:

controller:
  publishService:
    enabled: false
    pathOverride: "two/svc"
  scope:
    enabled: false
  admissionWebhooks:
    enabled: false

It looks like the scope variable is set incorrectly so the ingress controller defaults to trying to manage ingresses across all namespaces. This should be an easy fix - just change the scope to true and upgrade the chart.

However, this still doesn’t explain the private IP address or its origin. Let’s try the following command:

kubectl get nodes -o wide
NAME                           STATUS   ROLES    AGE    VERSION   INTERNAL-IP
ingress-puzzle-control-plane   Ready    master   5h3m   v1.18.2   172.18.0.2

So this is where it comes from - it’s the IP of the k8s node we’ve been running our tests on. But why would it get allocated to an ingress? To understand that we need to have a look at the nginx-ingress controller source code, specifically this function from status.go:

func (s *statusSync) runningAddresses() ([]string, error) {
	if s.PublishStatusAddress != "" {
		return []string{s.PublishStatusAddress}, nil
	}

	if s.PublishService != "" {
		return statusAddressFromService(s.PublishService, s.Client)
	}

	// get information about all the pods running the ingress controller
	pods, err := s.Client.CoreV1().Pods(s.pod.Namespace).List(context.TODO(), metav1.ListOptions{
		LabelSelector: labels.SelectorFromSet(s.pod.Labels).String(),
	})
	if err != nil {
		return nil, err
	}

	addrs := make([]string, 0)
	for _, pod := range pods.Items {
		// only Running pods are valid
		if pod.Status.Phase != apiv1.PodRunning {
			continue
		}

		name := k8s.GetNodeIPOrName(s.Client, pod.Spec.NodeName, s.UseNodeInternalIP)
		if !sliceutils.StringInSlice(name, addrs) {
			addrs = append(addrs, name)
		}
	}

	return addrs, nil
}

This is how the nginx-ingress controller determines the address to report in the ingress status:

Check if the address is statically set with the --publish-status-address flag.
Try to collect addresses from a published service (load-balancer).
If both of the above have failed, get the list of pods and return the IPs of the nodes they are running on.

This last bit is why we had that private IP in the status field. If you look at the above values YAML again, you’ll see that the publishService value is overridden with a static service called svc. However, because this service doesn’t exist and was never created, the ingress controller will fail to collect the right IP and will fall through to step 3.

The logic described above is quite common and is also implemented by Kong ingress controller. The idea is that if your k8s nodes are running in a cluster with public IPs, this should still make the ingress accessible, even without a load-balancer.

Getting Started with Cluster API using Docker

Sun, 03 May 2020 00:00:00 +0000

Cluster API (CAPI) is a relatively new project aimed at deploying Kubernetes clusters using a declarative API (think YAML). The official documentation (a.k.a. the Cluster API book), does a very good job explaining the main concepts and goals of the project. I always find that one of the best ways to explore new technology is to see how it works locally, on my laptop, and Cluster API has a special “Docker” infrastructure provider (CAPD) specifically for that. However, the official documentation for how to setup a docker managed cluster is very poor and fractured. In this post, I’ll try to demonstrate the complete journey to deploy a single CAPI-managed k8s cluster and provide some explanation of what happens behind the scene so that its easier to troubleshoot when things go wrong.

Prerequisites

Two things must be pre-installed before we can start building our test clusters:

kind - a tool to setup k8s clusters in docker containers, it will be used as a management (a.k.a. bootstrap) cluster.
clusterctl - a command line tool to interact with the management cluster.

We’re gonna need run a few scripts from the Cluster API Github repo, so let’s get a copy of it locally:

git clone --depth=1 git@github.com:kubernetes-sigs/cluster-api.git && cd cluster-api

When building a management cluster with kind, it’s a good idea to mount the docker.sock file from your host OS into the kind cluster, as it is mentioned in the book. This will allow you to see the CAPD-managed nodes directly in your hostOS as regular docker containers.

cat > kind-cluster-with-extramounts.yaml <<EOF
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
  - role: control-plane
    extraMounts:
      - hostPath: /var/run/docker.sock
        containerPath: /var/run/docker.sock
EOF
kind create cluster --config ./kind-cluster-with-extramounts.yaml --name clusterapi
kubectl cluster-info --context kind-clusterapi

At this stage you should have your kubectl pointed at the new kind cluster, which can be verified like this:

kubectl get nodes -o wide
NAME                       STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION          CONTAINER-RUNTIME
clusterapi-control-plane   Ready    master   66s   v1.17.0   172.17.0.2    <none>        Ubuntu 19.10   5.6.6-200.fc31.x86_64   containerd://1.3.2

Preparing a CAPD controller

The docker image for the CAPD controller is not available in the public registry, so we need to build it locally. The following two commands will build the image and update the installation manifests to use that image.

make -C test/infrastructure/docker docker-build REGISTRY=gcr.io/k8s-staging-capi-docker
make -C test/infrastructure/docker generate-manifests REGISTRY=gcr.io/k8s-staging-capi-docker

Next, we need side-load this image into a kind cluster to make it available to the future CAPD deployment

kind load docker-image --name clusterapi gcr.io/k8s-staging-capi-docker/capd-manager-amd64:dev

Setting up a Docker provider

Once again, following the book, we need to run a local override script to generate a set of manifests for Docker provider:

cat > clusterctl-settings.json <<EOF
{
  "providers": ["cluster-api","bootstrap-kubeadm","control-plane-kubeadm", "infrastructure-docker"]
}
EOF
cmd/clusterctl/hack/local-overrides.py

You should be able to see the generated manifests at ~/..cluster-api/overrides/infrastructure-docker/latest/infrastructure-components.yaml, the only last thing we need to do is let clusterctl know where to find them:

cat > ~/.cluster-api/clusterctl.yaml <<EOF
providers:
  - name: docker
    url: $HOME/.cluster-api/overrides/infrastructure-docker/latest/infrastructure-components.yaml
    type: InfrastructureProvider
EOF

Finally, we can use the clusterctl init command printed by the local-verrides.py script to create all CAPI and CAPD components inside our kind cluster:

clusterctl init --core cluster-api:v0.3.0 --bootstrap kubeadm:v0.3.0 --control-plane kubeadm:v0.3.0 --infrastructure docker:v0.3.0

At this stage, we should see the following deployments created and ready (1/1).

k get deploy -A | grep cap
capd-system                         capd-controller-manager                         1/1
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager       1/1
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager   1/1
capi-system                         capi-controller-manager                         1/1
capi-webhook-system                 capi-controller-manager                         1/1 
capi-webhook-system                 capi-kubeadm-bootstrap-controller-manager       1/1
capi-webhook-system                 capi-kubeadm-control-plane-controller-manager   1/1

If capd-system deployment is not READY and stuck trying to pull the image, make sure that the capd-controller-manager deployment is using the image that was generated in the previous section.

Generating a CAPD-managed cluster manifest

All the instructions provided so far can also be found in the official documentation. However, at this stage, the book started having big gaps that were not trivial to figure out. TLDR: you can just run the below command to generate a sample capd cluster manifest and move on to the next section. However if you ever need to modify this command, check out my notes below it.

DOCKER_POD_CIDRS="192.168.0.0/16" \
DOCKER_SERVICE_CIDRS="10.128.0.0/12" \
DOCKER_SERVICE_DOMAIN="cluster.local" \
clusterctl config cluster capd --kubernetes-version v1.17.5 \
--from ./test/e2e/data/infrastructure-docker/cluster-template.yaml \
--target-namespace default \
--control-plane-machine-count=1 \
--worker-machine-count=1 \
> capd.yaml

At the time of writing, CAPD used kindest/node docker images (see defaultImageName in test/infrastructure/docker/docker/machines.go) and combined it with a tag provided in the --kubernetes-version argument. Be sure to always check if there’s a matching tag on dockerhub. If it is missing (e.g. v1.17.3), Machine controller will fail to create a docker container for your kubernetes cluster and you’ll only see the load-balancer container being created.

Another issue is the clusterctl may not find the cluster-template.yaml where it expects, so it would have to be provided with the --from argument. This template would require additional variables (all that start with DOCKER_) that have to be provided for it to be rendered. These can be modified as long as you understand what they do.

Note: never set the POD CIDR equal to the Service CIDR unless you want to spend your time troubleshooting networking and DNS issues.

Finally, you should also make sure that the target namespace is specified explicitly, otherwise the generated manifest will contain an incorrect combination of namespaces and will get rejected by the validating webhook.

Creating a CAPD-managed cluster

The final step is to apply the generated manifest and let the k8s controllers do their job.

kubectl apply -f capd.yaml

It’s worth spending a bit of time understanding what’s some of these controllers do. The DockerCluster controller is responsible for the creation of a load-balancing container (capd-lb). A load-balancer is needed to provide a single API endpoint in front of multiple controller nodes. It’s built on top of the HAProxy image (kindest/haproxy:2.1.1-alpine), and does the healthchecking and load-balancing across all cluster controller nodes. It’s worth noting that the DockerCluster resource is marked as READY as soon as the controller can read the IP assigned to the capd-lb container, which doesn’t necessarily reflect that the cluster itself is built.

Typically, all nodes in a CAPI-managed clusters are bootstrapped with cloud-init that is generated by the bootstrap controller. However Docker doesn’t have a cloud-init equivalent, so the DockerMachine controller simply executes each line of the bootstrap script using the docker exec commands. It’s also worth noting that containers themselves are also managed using docker CLI and not API.

Installing CNI and MetalLB

As a bonus, I’ll show how to install CNI and MetalLB to build a completely functional k8s cluster. First, we need to extract the kubeconfig file and save it locally:

kubectl get secret/capd-kubeconfig -o jsonpath={.data.value} \
  | base64 --decode  > ./capd.kubeconfig

Now we can apply the CNI config, as suggested in the book.

KUBECONFIG=./capd.kubeconfig kubectl \
  apply -f https://docs.projectcalico.org/v3.12/manifests/calico.yaml

A minute later, both nodes should transition to Ready state:

KUBECONFIG=./capd.kubeconfig kubectl get nodes
NAME                              STATUS   ROLES    AGE   VERSION
capd-capd-control-plane-hn724     Ready    master   30m   v1.17.5
capd-capd-md-0-84df67c74b-lzm6z   Ready    <none>   29m   v1.17.5

In order to be able to create load-balancer type services, we can install MetalLB in L2 mode. Thanks to the docker.sock mounting we’ve done above, our test cluster is now attached to the same docker bridge as the rest of the containers in host OS. We can easily determine what subnet is being used by it:

DOCKER_BRIDGE_SUBNET=$(docker network inspect bridge | jq -r '.[0].IPAM.Config[0].Subnet')

Next, using the ipcalc tool, we can pick a small range from the high end of that subnet:

DOCKER_HIGHEND_RANGE=$(ipcalc -s 6 ${DOCKER_BRIDGE_SUBNET}  | grep 29 | tail -n 1)

Now we can create the configuration for MetalLB:

cat > metallb_cm.yaml <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    address-pools:
    - name: my-ip-space
      protocol: layer2
      addresses:
      - $DOCKER_HIGHEND_RANGE   
EOF

Finally, all we have to do is install it:

KUBECONFIG=./capd.kubeconfig kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.9.3/manifests/namespace.yaml
KUBECONFIG=./capd.kubeconfig kubectl apply -f metallb_cm.yaml
KUBECONFIG=./capd.kubeconfig kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.9.3/manifests/metallb.yaml
KUBECONFIG=./capd.kubeconfig kubectl create secret generic -n metallb-system memberlist --from-literal=secretkey="$(openssl rand -base64 128)"

To test it, we can deploy a test application and expose it with a service of type LoadBalancer:

KUBECONFIG=./capd.kubeconfig kubectl create deployment test --image=nginx
KUBECONFIG=./capd.kubeconfig kubectl expose deployment test --name=lb --port=80 --target-port=80 --type=LoadBalancer

Now we should be able to access the application running inside the cluster by hitting the external load-balancer IP:

MetalLB_IP=$(KUBECONFIG=./capd.kubeconfig kubectl get svc lb -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -s $MetalLB_IP | grep "Thank you"
<p><em>Thank you for using nginx.</em></p>

Network Simulations with Network Service Mesh

Fri, 24 Jan 2020 00:00:00 +0000

In September 2019 I had the honour to present at Open Networking Summit in Antwerp. My talk was about meshnet CNI plugin, k8s-topo orchestrator and how to use them for large-scale network simulations in Kubernetes. During the same conference, I attended a talk about Network Service Mesh and its new kernel-based forwarding dataplane which had a lot of similarities with the work that I’ve done for meshnet. Having had a chat with the presenters, we’ve decided that it would be interesting to try and implement a meshnet-like functionality with NSM. In this post, I’ll try to document some of the findings and results of my research.

Network Service Mesh Introduction

NSM is a CNCF project aimed at providing service mesh-like capabilities for L2/L3 traffic. In the context of Kubernetes, NSM’s role is to interconnect pods and setup the underlying forwarding, which involves creating new interfaces, allocating IPs and configuring pod’s routing table. The main use cases are cloud-native network functions (e.g. 5G), service function chaining and any containerised applications that may need to talk over non-standard protocols. Similar to traditional service meshes, the intended functionality is achieved by injecting sidecar containers that communicate with a distributed control plane of network service managers, deployed as a daemonset.

I’ll try to avoid repeating NSM’s theory here and instead refer my readers to the official documentation and a very good introductory slide deck. There are a few concepts, however, that are critical to the understanding of this blogpost, that I’ll mention here briefly:

Network Services are built around a client-server model - a client receives a service from an endpoint (server).
Both client and endpoint are implemented as containers and interact with local control plane agents over a gRPC-based API.
Typically, a client would request a service with ns.networkservicemesh.io annotation, which gets matched by a mutating webhook responsible for injecting an init container.
Endpoints, being designed specifically to provide network services, have endpoint container statically defined as a sidecar (unless they natively implement NSM’s SDK).
One important distinction between client and endpoint sidecars is that the former is an init container (runs to completion at pod create time) and the latter is a normal sidecar which allows service reconfiguration at runtime.
All client and endpoint configurations get passed as environment variables to the respective containers either dynamically (client) or statically (endpoint).

Given all of the above, this is how you’d use NSM to create a point-to-point link between any two pods.

Using NSM to create links between pods

First, we need to decide which side of the link will be a client and which will be an endpoint. This is where we’ll abuse NSM’s concepts for the first time as it really doesn’t matter how this allocation takes place. For a normal network service, it’s fairly easy to identify and map client/server roles, however, for topology simulations they can be assigned arbitrarily as both sides of the link are virtually equivalent.

The next thing we need to do is statically add sidecar containers not only to the endpoint side of the link but to the client as well. This is another abuse of NSM’s intended mode of operation, where a client init container gets injected automatically by the webhook. The reason for that is that the init container will block until its network service request gets accepted, which may create a circular dependency if client/endpoint roles are assigned arbitrarily, as discussed above.

The resulting “endpoint” side of the link will have the following pod manifest. The NSE sidecar container will read the environment variables and use NSM’s SDK to register itself with a p2p network service with a device=device-2 label.

apiVersion: v1
kind: Pod
metadata:
  name: device-2
spec:
  containers:
  - image: alpine:latest
    command: ["tail", "-f", "/dev/null"]
    name: alpine
  - name: nse-sidecar
    image: networkservicemesh/topology-sidecar-nse:master
    env:
    - name: ENDPOINT_NETWORK_SERVICE
      value: "p2p"
    - name: ENDPOINT_LABELS
      value: "device=device-2"
    - name: IP_ADDRESS
      value: "10.60.1.0/24"
    resources:
      limits:
        networkservicemesh.io/socket: "1"

When a local control plane agent receives the above registration request, it will create a new k8s NetworkServiceEndpoint resource, effectively letting all the other agents know where to find this particular service endpoint (in this case it’s the k8s node called nsm-control-plane). Note that the below resource is managed by NSM’s control plane and should not be created by the user:

apiVersion: networkservicemesh.io/v1alpha1
kind: NetworkServiceEndpoint
metadata:
  generateName: p2p
  labels:
    device: device-2
    networkservicename: p2p
  name: p2ppdp2d
spec:
  networkservicename: p2p
  nsmname: nsm-control-plane
  payload: IP
status:
  state: RUNNING

The next bit is the manifest of the network service itself. Its goal is to establish a relationship between multiple clients and endpoints of a service by matching their network service labels.

apiVersion: networkservicemesh.io/v1alpha1
kind: NetworkService
metadata:
  name: p2p
spec:
  matches:
  - match: 
    sourceSelector:
      link: net-0
    route:
    - destination: 
      destinationSelector:
        device: device-2
  payload: IP

The final bit is the “client” side of the link which will have the following pod manifest. Note that the format of NS_NETWORKSERVICEMESH_IO variable is the same as the one used in annotations and can be read as “client requesting a p2p service with two labels (link=net-0 and peerif=eth21) and wants to connect to it over a local interface called eth12”.

apiVersion: v1
kind: Pod
metadata:
  name: device-1
spec:
  containers:
  - image: alpine:latest
    command: ["tail", "-f", "/dev/null"]
    name: alpine
  - name: nsc-sidecar
    image: networkservicemesh/topology-sidecar-nsc:master
    env:
    - name: NS_NETWORKSERVICEMESH_IO
      value: p2p/eth12?link=net-0&peerif=eth21
    resources:
      limits:
        networkservicemesh.io/socket: "1"

The client’s sidecar will read the above environment variable and send a connection request to the local control plane agent which will perform the following sequence of steps:

Locate a network service called p2p.
Find a match based on client-provided labels (link=net-0).
Try to find a matching network service endpoint (device=device-2).
Contact the remote agent hosting a matching endpoint (found in NSE CRDs) and relay the connection request.
If the request gets accepted by the endpoint, instruct the local forwarding agent to set up pod’s networking.

Topology orchestration with k8s-topo

Looking at the above manifests, it’s clear that writing them manually, even for smaller topologies, can be a serious burden. That’s why I’ve adapted the k8s-topo tool that I’ve written originally for meshnet-cni to produce and instantiate NSM-compliant manifest based on a single light-weight topology YAML file. The only thing that’s needed to make it work with NSM is to add a nsm: true to the top of the file, e.g.:

nsm: true
links:
  - endpoints: ["device-1:eth12", "device-2:eth21"]

Behind the scenes, k8s-topo will create the required network service manifest and configure all pods with correct sidecars and variables. As an added bonus, it will still attempt to inject startup configs and expose ports as described here.

NSM vs Meshnet for network simulations

In the context of virtual network simulations, both NSM and meshnet-cni can perform similar functions, however, their implementation and modes of operation are rather different. Here are the main distinctions of a CNI plugin approach:

All networking is setup BEFORE the pod is started.
CNI plugin does all the work so there’s no need for sidecar containers.
A very thin code base for a very specific use case.

And here are some of the distinctions of an NSM-based approach:

All networking is setup AFTER the pod is started.
This does come with a requirement for a sidecar container, but potentially allows for runtime reconfiguration.
No requirement for a CNI plugin at all.
More generic use cases are possible.

In the end, none of the options limit the currently available featureset of k8s-topo and the choice can be done based on the characteristics of an individual environment, e.g. if it’s a managed k8s from GCP (GKE) or Azure (AKS) then most likely you’ll be running kubenet and won’t have an option to install any CNI plugin at all, in which case NSM can be the only available solution.

Demo

Now it’s demo time and I’ll show how to use k8s-topo together with NSM to build a 10-node virtual router topology. We start by spinning up a local kind kubernetes cluster and installing NSM on it:

git clone https://github.com/networkservicemesh/networkservicemesh
cd networkservicemesh
make helm-init
SPIRE_ENABLED=false INSECURE=true FORWARDING_PLANE=kernel make helm-install-nsm

Next, we install the k8s-topo deployment and connect to the pod running it:

kubectl create -f https://raw.githubusercontent.com/networkop/k8s-topo/master/manifest.yml
kubectl exec -it deploy/k8s-topo -- sh

For demonstration purposes I’ll use a random 10-node tree topology generated using a loop-erased random walk:

./examples/builder/builder 10 0

The only thing needed to make it work with NSM is set the nsm flag to true:

sed -i '$ a\nsm: true' ./examples/builder/random.yml

Now everything’s ready for us to instantiate the topology inside k8s:

k8s-topo --create ./examples/builder/random.yml

Once all the pods are up, we can issue a ping from one of the routers to every other router in the topology and confirm the connectivity between their loopback IPs:

for i in `seq 0 9`; do (kubectl exec qrtr-192-0-2-0 -c router -- ping -c 1 192.0.2.$i|grep loss); done

1 packets transmitted, 1 packets received, 0% packet loss
1 packets transmitted, 1 packets received, 0% packet loss
1 packets transmitted, 1 packets received, 0% packet loss
1 packets transmitted, 1 packets received, 0% packet loss
1 packets transmitted, 1 packets received, 0% packet loss
1 packets transmitted, 1 packets received, 0% packet loss
1 packets transmitted, 1 packets received, 0% packet loss
1 packets transmitted, 1 packets received, 0% packet loss
1 packets transmitted, 1 packets received, 0% packet loss
1 packets transmitted, 1 packets received, 0% packet loss

If you want to have a look at your topology, it’s possible to make k8s-topo generate a D3 graph of all pods and their connections and view it in the browser:

k8s-topo --graph ./examples/builder/random.yml
INFO:__main__:D3 graph created
INFO:__main__:URL: http://172.17.0.3:30000

Network-as-a-Service Part 3 - Authentication and Admission control

Thu, 27 Jun 2019 00:00:00 +0000

In the previous two posts, we’ve seen how to build a custom network API with Kubernetes CRDs and push the resulting configuration to network devices. In this post, we’ll apply the final touches by enabling oAuth2 authentication and enforcing separation between different tenants. All of these things are done while the API server processes incoming requests, so it would make sense to have a closer look at how it does that first.

Kubernetes request admission pipeline

Every incoming request has to go through several stages before it can get accepted and persisted by the API server. Some of these stages are mandatory (e.g. authentication), while some can be added through webhooks. The following diagram comes from another blogpost that covers each one of these stages in detail:

Specifically for NaaS platform, this is how we’ll use the above stages:

All users will authenticate with Google and get mapped to individual namespace/tenant based on their google alias.
Mutating webhook will be used to inject default values into each request and allow users to define ranges as well as individual ports.
Object schema validation will do the syntactic validation of each request.
Validating webhook will perform the semantic validation to make sure users cannot change ports assigned to a different tenant.

The following sections will cover these stages individually.

Authenticating with Google

Typically, external users are authenticated using X.509 certificates, however, lack of CRL or OCSP support in Kubernetes creates a problem since lost or exposed certs cannot be revoked. One of the alternatives is to use OpenID Connect which works on top of the OAuth 2.0 protocol and is supported by a few very big identity providers like Google, Microsoft and Salesforce. Although OIDC has its own shortcomings (read this blogpost for details), it is still often preferred over X.509.

In order to authenticate users with OIDC, we need to do three things:

Configure the API server to bind different user aliases to their respective tenants.
Authenticate with the identity provider and get a signed token.
Update local credentials to use this token.

The first step is pretty straightforward and can be done with a simple RBAC manifest. The latter two steps can either be done manually or automatically with the help of dexter. NaaS Github repo contains a sample two-liner bash script that uses dexter to authenticate with Google and save the token in the local ~/.kube/config file.

All that’s required from a NaaS administrator is to maintain an up-to-date tenant role bindings and users can authenticate and maintain their tokens independently.

Mutating incoming requests

Mutating webhooks are commonly used to inject additional information (a sidecar proxy for service meshes) or defaults values (default CPU/memory) into incoming requests. Both mutating and validating webhooks get triggered based on a set of rules that match the API group and type of the incoming request. If there’s a match, a webhook gets called by the API server with an HTTP POST request containing the full body of the original request. The NaaS mutating webhook is written in Python/Flask and the first thing it does is extract the payload and its type:

request_info = request.json
modified_spec = copy.deepcopy(request_info)
workload_type = modified_spec["request"]["kind"]["kind"]

Next, we set the default values and normalize ports:

if workload_type == "Interface":
    defaults = get_defaults()
    set_intf_defaults(modified_spec["request"]["object"]["spec"], defaults)
    normalize_ports(modified_spec["request"]["object"]["spec"])

The last function expands interface ranges, i.e. translates 1-5 into 1,2,3,4,5.

for port in ports:
    if not "-" in port:
        result.append(str(port))
    else:
        start, end = port.split("-")
        for num in range(int(start), int(end) + 1):
            result.append(str(num))

Finally, we generate a json patch from the diff between the original and the mutated request, build a response and send it back to the API server.

patch = jsonpatch.JsonPatch.from_diff(
    request_info["request"]["object"], modified_spec["request"]["object"]
)
admission_response = {
    "allowed": True,
    "uid": request_info["request"]["uid"],
    "patch": base64.b64encode(str(patch).encode()).decode(),
    "patchtype": "JSONPatch",
}
return jsonify(admissionReview = {"response": admission_response})

The latest (v1.15) release of Kubernetes has added support for default values to be defined inside the OpenAPI validation schema, making the job of writing mutating webhooks a lot easier.

Validating incoming requests

As we’ve seen in the previous post, it’s possible to use OpenAPI schema to perform syntactic validation of incoming requests, i.e. check the structure and the values of payload variables. This function is very similar to what you can accomplish with a YANG model and, in theory, OpenAPI schema can be converted to YANG and vice versa. However useful, such validation only takes into account a single input and cannot cross-correlate this data with other sources. In our case, the main goal is to protect one tenant’s data from being overwritten by request coming from another tenant. In Kubernetes, semantic validation is commonly done using validating admission webhooks and one of the most interesting tools in this landscape is Open Policy Agent and its policy language called Rego.

Using OPA’s policy language

Rego is a special-purpose DSL with “rich support for traversing nested documents”. What this means is that it can iterate over dictionaries and lists without using traditional for loops. When it encounters an iterable data structure, it will automatically expand it to include all of its possible values. I’m not going to try to explain how opa works in this post, instead I’ll show how to use it to solve our particular problem. Assuming that an incoming request is stored in the input variable and devices contain all custom device resources, this is how a Rego policy would look like:

input.request.kind.kind == "Interface"
new_tenant := input.request.namespace
port := input.request.object.spec.services[i].ports[_]
new_device := input.request.object.spec.services[i].devicename
existing_device_data := devices[_][lower(new_device)].spec
other_tenant := existing_device_data[port].annotations.namespace
not new_tenant == other_tenant

The actual policy contains more than 7 lines but the most important ones are listed above and perform the following sequence of actions:

Verify that the incoming request is of kind Interface
Extract its namespace and save it in the new_tenant variable
Save all ports in the port variable
Remember which device those ports belong to in the new_device variables
Extract existing port allocation information for each one of the above devices
If any of the ports from the incoming request is found in the existing data, record its owner’s namespace
Deny the request if the requesting port owner (tenant) is different from the current tenant.

Although Rego may not be that easy to write (or debug), it’s very easy to read, compared to an equivalent implemented in, say, Python, which would have taken x3 the number of lines and contain multiple for loops and conditionals. Like any DSL, it strives to strike a balance between readability and flexibility, while abstracting away less important things like web server request parsing and serialising.

The same functionality can be implemented in any standard web server (e.g. Python+Flask), so using OPA is not a requirement

Demo

This is a complete end-to-end demo of Network-as-a-Service platform and encompasses all the demos from the previous posts. The code for this demo is available here and can be run on any Linux OS with Docker.

0. Prepare for OIDC authentication

For this demo, I’ll only use a single non-admin user. Before you run the rest of the steps, you need to make sure you’ve followed dexter to setup google credentials and update OAuth client and user IDs in kind.yaml, dexter-auth.sh and oidc/manifest.yaml files.

1. Build the test topology

This step assumes you have docker-topo installed and c(vEOS) image built and available in local docker registry.

make topo

This test topology can be any Arista EOS device reachable from the localhost. If using a different test topology, be sure to update the inventory file.

2. Build the Kubernetes cluster

The following step will build a docker-based kind cluster with a single control plane and a single worker node.

make kubernetes

3. Check that the cluster is functional

The following step will build a base docker image and push it to dockerhub. It is assumed that the user has done docker login and has his username saved in the DOCKERHUB_USER environment variable.

export KUBECONFIG="$(kind get kubeconfig-path --name="naas")"
make warmup
kubectl get pod test

This is a 100MB image, so it may take a few minutes for test pod to transition from ContainerCreating to Running

4. Build the NaaS platform

The next command will install and configure both mutating and validating admission webhooks, the watcher and scheduler services and all of the required CRDs and configmaps.

make build

5. Authenticate with Google

Assuming all files from step 0 have been updated correctly, the following command will open a web browser and prompt you to select a google account to authenticate with.

make oidc-build

From now on, you should be able to switch to your google-authenticated user like this:

kubectl config use-context mk

And back to the admin user like this:

kubectl config use-context kubernetes-admin@naas

6. Test

To demonstrate how everything works, I’m going to issue three API requests. The first API request will set up a large range of ports on test switches.

kubectl config use-context mk
kubectl apply -f crds/03_cr.yaml

The second API request will try to re-assign some of these ports to a different tenant and will get denied by the validating controller.

kubectl config use-context kubernetes-admin@naas
kubectl apply -f crds/04_cr.yaml        
Error from server (Port 11@deviceA is owned by a different tenant: tenant-a (request request-001), Port 12@deviceA is owned by a different tenant: tenant-a (request request-001),

The third API request will update some of the ports from the original request within the same tenant.

kubectl config use-context mk
kubectl apply -f crds/05_cr.yaml

The following result can be observed on one of the switches:

devicea#sh run int eth2-3
interface Ethernet2
   description request-002
   shutdown
   switchport trunk allowed vlan 100
   switchport mode trunk
   spanning-tree portfast
interface Ethernet3
   description request-001
   shutdown
   switchport trunk allowed vlan 10
   switchport mode trunk
   spanning-tree portfast

Outro

Currently, Network-as-a-Service platform is more of a proof-of-concept of how to expose parts of the device data model for end users to consume in a safe and controllable way. Most of it is built out of standard Kubernetes component and the total amount of Python code is under 1000 lines, while the code itself is pretty linear. I have plans to add more things like an SPA front-end, Git and OpenFaaS integration, however, I don’t want to invest too much time until I get some sense of external interest. So if this is something that you like and think you might want to try, ping me via social media and I’ll try to help get things off the ground.

Network-as-a-Service Part 2 - Designing a Network API

Thu, 20 Jun 2019 00:00:00 +0000

In the previous post, we’ve examined the foundation of the Network-as-a-Service platform. A couple of services were used to build the configuration from data models and templates and push it to network devices using Nornir and Napalm. In this post, we’ll focus on the user-facing part of the platform. I’ll show how to expose a part of the device data model via a custom API built on top of Kubernetes and how to tie it together with the rest of the platform components.

Interacting with a Kubernetes API

There are two main ways to interact with a Kubernetes API: one using a client library, which is how NaaS services communicate with K8s internally, the other way is with a command line tool called kubectl, which is intended to be used by humans. In either case, each API request is expected to contain at least the following fields:

apiVersion - all API resources are grouped and versioned to allow multiple versions of the same kind to co-exist at the same time.
kind - defines the type of object to be created.
metadata - collection of request attributes like name, namespaces, labels etc.
spec - the actual payload of the request containing the attributes of the requested object.

In order to describe these fields in a concise and human-readable way, API requests are often written in YAML, which is why you’ll see a lot of YAML snippets throughout this post. You can treat each one of those snippets as a separate API call that can be applied to a K8s cluster using a kubectl apply command.

Designing a Network Interface API

The structure and logic behind any user-facing API can be very customer-specific. Although the use-case I’m focusing on here is a very simple one, my goal is to demonstrate the idea which, if necessary, can be adapted to other needs and requirements. So let’s assume we want to allow end users to change access ports configuration of multiple devices and this is how a sample API request may look like:

apiVersion: network.as.a.service/v1
kind: Interface
metadata:
  name: request-001
  namespace: tenant-a
spec:
  services:
    - devicename: deviceA
      ports: ["1", "15"]
      vlan: 10
      trunk: yes
    - devicename: deviceB
      ports: ["1","10", "11"]
      vlan: 110
      trunk: no

There are a few things to note in the above request:

Every request will have a unique name per namespace (tenant).
The main payload inside the .spec property is a list of (VLAN) network services that need to be configured on network devices.
Each element of the list contains the name of the device, list of ports and a VLAN number to be associated with them.

Now let’s see what it takes to make Kubernetes “understand” this API.

Introducing Kubernetes CRDs

API server is the main component of the control plane of a Kubernetes cluster. It receives all incoming requests, validates them, notifies the respective controllers and stores them in a database.

Apart from the APIs exposing a set of standard resources, there’s an ability to define custom resources - user-defined data structures that an API server can accept and store. Custom resources are the main building blocks for a lot of platforms built on top of K8s and at the very least they allow users to store and retrieve some arbitrary YAML data.

In order to be able to create a custom resource, we need to define it with a custom resource definition (CRD) object that would describe the name of the resource, the api group it belongs to and, optionally, the structure and values of the YAML data via OpenAPI v3 schema. This is how a CRD for the above Interface API would look like:

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: interfaces.network.as.a.service
spec:
  group: network.as.a.service
  versions:
  - name: v1
    served: true
    storage: true
  scope: Namespaced
  subresources:
    status: {}
  names:
    plural: interfaces
    singular: interface
    kind: Interface
    shortNames:
    - intf
  validation:
    openAPIV3Schema:
      required: ["spec"]
      properties:
        spec:
          required: ["services"]
          properties:
            services:
              type: array
              items: 
                type: object
                required: ["devicename", "vlan", "ports"]
                properties:
                  devicename: 
                    type: string
                  vlan:
                    type: integer
                    minimum: 1
                    maximum: 4094
                  ports:
                    type: array
                    items:
                      type: string
                  trunk:
                    type: boolean

As soon as we kubectl apply the above YAML, our API server will expose the Interface API for all external users to perform standard CRUD operations on, and store the results alongside other K8s resources in etcd datastore.

Kubernetes custom controllers

Custom resources, by themselves, do not provide any way to define a business logic of what to do with their data. This job is normally performed by Kubernetes controllers that “watch” events that happen to these resources and perform actions based on that. This tandem between custom controllers and CRDs is so common, it led to the creation of an operator pattern and a whole slew of operator frameworks with languages ranging from Go to Ansible.

However, as I’ve mentioned in the previous post, sometimes using a framework does not give you any benefit and after having looked at some of the most popular ones, I’ve decided to settle on my own implementation which turned out to be a lot easier. In essence, all that’s required from a custom controller is to:

Subscribe to events about a custom resource (via K8s API).
Once an event is received, perform the necessary business logic.
Update the resource status if required.

Let’s see how these custom controllers are implemented inside the NaaS platform.

NaaS controller architecture

NaaS platform has a special watcher service that implements all custom controller logic. Its main purpose is to process incoming Interface API events and generate a device-centric interface data model based on them.

Internally, the watcher service is built out of two distinct controllers:

interface-watcher - listens to Interface API events and updates a custom Device resource that stores an aggregated device-centric view of all interface API requests received up to date. Once all the changes have been made, it updates the status of the request and notifies the scheduler about all the devices affected by this event.
device-watcher - listens to Device API events and generates configmaps containing a device interface data model. These configmaps are then consumed by enforcers to build the access interface part of the total device configuration.

Interface-watcher architecture

The main loop of the interface-watcher receives Interface API events as they arrive and processes each network service individually:

for network_service in event_object["spec"]["services"]:
    results.append(
        process_service(event_metadata, network_service, action, defaults)
    )

For each service, depending on the type of the event, we either add, update or delete ports from the global device-centric model:

device = get_or_create_device(device_name, defaults)
device_data = device["spec"]
if action == "ADDED":
    device_data = add_ports(
        network_service, device_data, resource_name, resource_namespace
    )
elif action == "DELETED":
    device_data = delete_ports(network_service, device_data, resource_name)
elif action == "MODIFIED":
    device_data = delete_all_ports(device_data, resource_name)
    device_data = add_ports(
        network_service, device_data, resource_name, resource_namespace
    )

For each of the added ports, we copy all settings from the original request and annotate it with metadata about its current owner and tenant:

ports = origin.pop("ports")
for port in ports:
    destination[port] = dict()
    destination[port] = origin
    destination[port]["annotations"] = annotate(owner, namespace)

This results in the following custom Device resource being created from the original Interface API request:

apiVersion: network.as.a.service/v1
kind: Device
metadata:
  name: devicea
  namespace: default
spec:
  "1":
    annotations:
      namespace: tenant-a
      owner: request-001
      timestamp: "2019-06-19 22:09:02"
    trunk: true
    vlan: 10
  "15":
    annotations:
      namespace: tenant-a
      owner: request-001
      timestamp: "2019-06-19 22:09:02"
    trunk: true
    vlan: 10

As subsequent requests can add or overwrite port ownership information, metadata allows the controller to be selective about which ports to modify in order to not accidentally delete ports assigned to a different owner:

new_destination = copy.deepcopy(destination)
for port in origin["ports"]:
    if (port in destination) and (
        destination[port].get("annotations", {}).get("owner", "") == owner
    ):
        log.debug(f"Removing port {port} from structured config")
        new_destination.pop(port, None)

Once the event has been processed, interface-watcher updates the device resource with the new values:

device["spec"] = device_data
update_device(device_name, device, defaults)

The last command triggers a MODIFIED event on the Device CR and this is where the next controller kicks in.

Device-watcher architecture

The job of a device-watcher is to, first, extract the payload from the above request:

event_object = event["object"]
event_metadata = event_object["metadata"
device_name = event_metadata["name"]
device_data = event_object["spec"]

The payload is then serialised into a string and saved as a configmap with additional pointers to Jinja template and order/priority number to help the enforcer build the full device configuration:

k8s_api = client.CoreV1Api()
body = {
    "metadata": {
        "name": device_name,
        "annotations": {"order": "99", "template": "interface.j2"},
        "labels": {"device": device_name, "app": "naas", "type": "model"},
    },
    "data": {"structured-config": yaml.safe_dump(device_data)},
}

k8s_api.replace_namespaced_config_map(
    device_name, event_metadata["namespace"], body
)

The remaining part of the workflow is similar to what was described in the previous post. The scheduler receives the request with the list of devices to be re-provisioned, spins up the required number of enforcers who collect all relevant data models, combine them with Jinja templates and push the new config.

Demo

This demo will pick up from where the previous one has left off. The assumption is that the test topology, K8s cluster and scheduler/enforcer services are already deployed as described in the previous post. The code for this demo can be downloaded here.

Deploy the watcher service

make watcher-build

The above command performs the following actions:

Creates two namespaces that will represent different platform tenants
Creates Interface and Device CRD objects describing our custom APIs
Deploys both watcher custom controllers along with the necessary RBAC rules
Uploads the interface jinja template to be used by enforcers

Test

Issue the first Interface API call:

kubectl apply -f crds/03_cr.yaml

Check the logs of the interface-watcher to make sure it’s picked up the Interface ADDED event:

kubectl logs deploy/interface-watcher
2019-06-20 08:20:01 INFO interface-watcher - interface_watcher: Watching Interface CRDs
2019-06-20 08:20:09 INFO interface-watcher - process_services: Received ADDED event request-001 of Interface kind
2019-06-20 08:20:09 INFO interface-watcher - process_service: Processing ADDED config for Vlans 10 on device devicea
2019-06-20 08:20:09 INFO interface-watcher - get_device: Reading the devicea device resource

Check the logs of the device-watcher to make sure it has detected the Device API event:

kubectl logs deploy/device-watcher
2019-06-20 08:20:09 INFO device-watcher - update_configmaps: Updating ConfigMap for devicea
2019-06-20 08:20:09 INFO device-watcher - update_configmaps: Creating configmap for devicea
2019-06-20 08:20:09 INFO device-watcher - update_configmaps: Configmap devicea does not exist yet. Creating

Check the logs of the scheduler service to see if it has been notified about the change:

kubectl logs deploy/scheduler
2019-06-20 08:20:09 INFO scheduler - webhook: Got incoming request from 10.32.0.4
2019-06-20 08:20:09 INFO scheduler - webhook: Request JSON payload {'devices': ['devicea', 'deviceb']}
2019-06-20 08:20:09 INFO scheduler - create_job: Creating job job-6rlwg0

Check the logs of the enforcer service to see if device configs have been generated and pushed:

kubectl logs jobs/job-6rlwg0
2019-06-20 08:20:18 INFO enforcer - push_configs: Downloading Model configmaps
2019-06-20 08:20:18 INFO enforcer - get_configmaps: Retrieving the list of ConfigMaps matching labels {'app': 'naas', 'type': 'model'}
2019-06-20 08:20:18 INFO enforcer - push_configs: Found models: ['devicea', 'deviceb', 'generic-cm']
2019-06-20 08:20:18 INFO enforcer - push_configs: Downloading Template configmaps
2019-06-20 08:20:18 INFO enforcer - get_configmaps: Retrieving the list of ConfigMaps matching labels {'app': 'naas', 'type': 'template'}

Finally, we can check the result on the device itself:

devicea#sh run int eth1
interface Ethernet1
   description request-001
   switchport trunk allowed vlan 10
   switchport mode trunk

Coming up

What we’ve covered so far is enough for end users to be able to modify access port settings on multiple devices via a standard API. However, there’s still nothing protecting the configuration created by one user from being overwritten by a request coming from a user in a different tenant. In the next post, I’ll show how to validate requests to make sure they do not cross the tenant boundaries. Additionally, I’ll show how to mutate incoming requests to be able to accept interface ranges and inject default values. To top it off, we’ll integrate NaaS with Google’s identity provider via OIDC to allow users to be mapped to different namespaces based on their google alias.

networkop on networkop

Linux Networking - Source IP address selection

The Default Scenario

User-provided IP

Netlink Route Source IP

Network Automation with CUE - Working with YANG-based APIs

Automating YANG-based APIs with CUE

Generating CUE definitions

Challenge 1 - Optional fields

Challenge 2 - ENUMs

Challenge 3 - YANG lists

Outro

Network Automation with CUE - Advanced workflows

CUE vs CUE scripting

Advanced Network Automation Workflow

Pulling Configuration Data from External Systems

Data Transformation

Configuration Push

Outro

Network Automation with CUE - Augmenting Ansible workflows

Ansible Automation Workflow

Input Data Validation

Data Transformation

Network Automation with CUE - Introduction

Evolution of Network Configuration Management

Automation Tools

Introducing CUE

Containerising NVIDIA Cumulus Linux

Innovation Trigger

Peak of Inflated Expectations

Building a test lab

The Trough of Disillusionment

Slope of Enlightenment

Plateau of Productivity

July Updates

Getting Started with eBPF and Go

Choosing an eBPF library

Step 1 - Writing the eBPF code

Step 2 - Writing the Go code

Step 3 - Code Distribution

Reading and Interesting References

Building your own SD-WAN with Envoy and Wireguard

Solution Overview

Smart VPN Client

Envoy Split Proxy

Monitoring

Credits

Self-hosted external DNS resolver for Kubernetes

External Kubernetes Resources

ExternalDNS

CoreDNS’s k8s_external plugin

The new k8s_gateway CoreDNS plugin

Anatomy of the "kubernetes.default"

The Special One

Controller of controllers

Always the first

Endpoint IPs

Solving the Ingress Mystery Puzzle

Puzzle recap

Ingress controller expected behaviour

Exhibit #1 - namespace one

Exhibit #2 - namespaces two & three

Exhibit #4 - flapping addresses

Exhibit #5 - controller logs

Exhibit #5 - cluster-wide logs

Whodunit

Getting Started with Cluster API using Docker

Prerequisites

Preparing a CAPD controller

Setting up a Docker provider

Generating a CAPD-managed cluster manifest

Creating a CAPD-managed cluster

Installing CNI and MetalLB

Network Simulations with Network Service Mesh

Network Service Mesh Introduction

Using NSM to create links between pods

Topology orchestration with k8s-topo

NSM vs Meshnet for network simulations

Demo

Network-as-a-Service Part 3 - Authentication and Admission control

CoreDNS’s `k8s_external` plugin

The new `k8s_gateway` CoreDNS plugin