Batfish

Stopping Network Outages Before They Start

2022-01-03T00:00:00+00:00

How do you detect buggy network configuration changes? My guess is that you use post-deployment checks and monitoring systems. And you should! But if that’s the only thing you’re doing, then you are unnecessarily risking network outages, breaches, and more. Those tools help you cure incidents after they occur, but they do nothing to prevent buggy changes from being deployed in the first place.

Of course, we all know that an ounce of prevention is worth a pound of cure, and we live by that motto in many walks of life. The medical domain is an obvious example, where preventative measures – physical exams, vaccinations, blood tests – are the norm and save lives every day. The software industry likewise puts enormous effort into preventing errors: each code change must pass a suite of behavioral tests and static checks before it is deployed to production. These tools prevent critical software bugs every day.

Like software, network automation pipelines can and should use behavioral tests and static checks to prevent configuration errors from reaching the production network. There are a variety of tools available to support these tasks. Network emulation (e.g., GNS3 and containerlab) and simulation (e.g., Batfish) enable behavioral testing. Configuration checkers (e.g., Batfish again) perform static checks.

Adding pre-deployment prevention into your workflow has huge benefits over relying solely on post-deployment cures:

Incident Prevention: Well, this first one is obvious but is also the most important – you can now prevent incidents from ever occurring, and hence completely avoid the damage that they would otherwise cause. Even with the best post-deployment checks, some incidents will cause significant damage. This damage may be due to a delay in detecting the problem, and even if the problem is detected quickly, it may still take time to mitigate successfully. A prominent recent example is Facebook’s global outage in October 2021 due to a buggy configuration change. While the problem was detected soon after the change was deployed to the network, it took six hours to bring the network back to a fully functional state.
Root-cause Diagnosis: Root-cause analysis is much easier to do for configuration bugs identified through pre-deployment checks. As an example, consider the lowly “fat finger” error, where a name is misspelled. This error can manifest in myriad ways in the live network, depending on the kind of name that is misspelled, the context in which it occurs, and the particular vendor. Batfish recently found a misspelled route-map name in a customer network that would have caused a forwarding loop. If the loop had been detected in the live network, diagnosing its root cause would have been very difficult. In contrast, a pre-deployment check that looks for undefined route-map names in the configuration can directly identify the bug independent of how it will affect the network’s behavior.
Incident Remediation: Once a problem is discovered and diagnosed, it has to be fixed. Again, the pre-deployment setting dramatically simplifies this task. When an issue arises in the live network, it must be mitigated as soon as possible to minimize (further) damage. Commonly this is done by rolling back to an old state. However, rollbacks can be expensive and they also must be done with great care for fear of introducing a new problem into the network. In the pre-deployment setting, the buggy configuration change hasn’t been deployed yet, so there is nothing to roll back. Further, once an appropriate update to the configuration change has been made, that version is itself validated with pre-deployment checks, ensuring that it truly fixes the problem and introduces no new problems.

In summary, pre-deployment validation is a must-have tool for any network automation pipeline. Most network incidents are caused by buggy configuration changes, and catching them before deployment is orders-of-magnitude cheaper than detecting and remediating them in the live network. Add a critical layer of prevention to your network – reach out and we’ll help you get started today!

Related resources:

Test drive network change MOPs without a lab: Learn how to test the impact of changes using Batfish Enterprise
The networking test pyramid: Learn how to develop a comprehensive test suite for network configurations
Network test automation: Rock, Paper, Scissors, Lizard, or Fish? Learn the differences between simulation and emulation for behavior testing

Incrementally automating your network

2021-11-30T00:00:00+00:00

Network automation can significantly benefit your organization. Gartner found that automating 70% of the network changes reduces outages by 50% and speeds service delivery by 50%. But achieving these results is elusive for most organizations—-they never get to the point where a substantial fraction of changes are successfully automated. A key hurdle is creating a reliable SoT (Source of Truth), a herculean task, especially for brownfield networks. This article outlines an incremental approach to network change automation that is not gated on a fully-fleshed SoT.

Let us begin by discussing the traditional “SoT-first” approach to network automation and its limitations. This approach proceeds as follows:

Build an SoT using something like Nautobot or NetBox.
Build jinja2 templates that can generate device configurations using the SoT.
Iterate on Steps 1 and 2 until generated configs are “close enough” to running configs.
Develop scripts to modify the SoT and templates to support common network changes.
Develop scripts to deploy configuration and run pre- and post-checks.

The promise here is that, once you have done these steps, you’d be able to automatically update the network for common change requests.

This SoT-first approach is incredibly difficult to execute, however. Any long-running network will have variations and snowflakes, and it can be difficult to capture the true values for various settings and develop the right configuration templates. If you have three vintages of leaf configurations with different settings in each, how do you determine the correct SoT settings? Which attributes are device level vs group level? Which templates apply to which devices? There is also the challenge that not all configuration elements are easy to represent in the SoT. Access control lists, firewall rules, and route maps are particularly challenging.

The difficulty of building the right SoT and templates means that it can take a long while to see the RoI (return on investment). Your organization may lose focus and give up along the way.

Is there a better approach to network automation—one that provides immediate returns and incremental value with incremental effort? Fortunately, the answer is yes. It is a change-focused approach that directly automates network changes, without requiring a detailed SoT that can generate full configs.

Suppose you want to automate firewall changes that open access to new services. The inputs to the change are IP addresses, protocols, and ports of the service. As part of executing this change, you need to validate that the requested access is not already there (pre-check), generate the rulebase update to permit access, and validate that the update has the intended impact (post-check). If you can generate and validate this change directly, you do not need to be able to generate the firewall configuration from scratch.

Thus, the change-focused approach to network automation works as follows:

Select the type of change you want to automate—favor common or risky ones.
Develop scripts to generate the configuration change
Develop scripts to deploy configuration and run pre- and post-checks.

This approach will provide RoI as soon as you have automated the first change. After automating the top ten changes, you would likely experience almost all of the benefits of network change automation.

To successfully execute the change-focused approach, your change MOPs (method of procedure) are a good place to start. As they are based on your network structure and business logic, they already have the information you need, including the configuration commands and pre- and post-checks. The right tools can help you effectively automate what the MOP is doing. For change generation, you may use jina templates rendered using request parameters or use Ansible resource modules.

For change validation, you need assurance about end-to-end network behavior. When network configs are not generated from scratch, you cannot just assume that devices are configured in a certain way; snowflakes and gaps in your assumptions can cause the change to not have its intended impact, or worse, cause collateral damage. Ideally, you would validate the impact of the change before deploying the change to production. Simulation tools like Batfish or emulation tools like GNS3 enable such validation. See our blog that discusses differences between these methods.

For examples of change-focused automation using Batfish Enterprise, see these blogs: Firewall rule changes, Provisioning a new subnet.

Change-focused automation is not incompatible with SoTs and can be augmented with whatever SoT you have. Even if you have partial SoT, with data that is not detailed enough to generate full device configs, you can use it to generate some types of changes and to generate automatic tests (e.g., no firewall change should open access to sensitive services defined in the SoT). However, unlike the SoT-first approach, change-focused automation is not gated on the availability of a fully-fleshed-out SoT. It can leverage even a minimal SoT and allows you to incrementally improve the SoT. It is this incremental nature that makes change-focused automation an attractive, practical approach to network automation.

The networking test pyramid

2021-06-16T00:00:00+00:00

An automated test suite is the key to continuous integration (CI), the DevOps practice of rapidly integrating changes into mainline. The test suite is run on every change to check that individual modules and the full system continue to behave as expected as developers add new features or modify existing ones. A high-quality test suite gives developers and reviewers the confidence that the changes are correct and do not cause collateral damage.

As networks become more like software (“Infrastructure-as-Code”), automated test suites—sometimes also called “policies” or “polices-as-code”—are essential to safely and rapidly evolving networks to meet new business needs. However, little guidance is available for network engineers toward creating a good test suite. What are all the types of tests that one should consider to cover all the bases? What is the purpose of each type and how do different types relate to each other?

To help create high-quality test suites, we present the networking test pyramid. This pyramid is adapted from the well-known software test pyramid and groups tests into increasing granularity levels from checking individual elements of the network to checking end-to-end behavior. While this concept applies to all types of network testing, our focus is on testing network configurations—and changes to them—because most outages are related to configuration changes. The tested configurations and changes may be generated manually or automatically (e.g., using templates and a source-of-truth).

In addition to a sound conceptual framework, creating a high-quality test suite also needs a practical testing framework. As anyone who has written automatic tests will testify, the choice of this framework is critical and directly influences what you can or cannot test. An ideal test framework enables you to easily express tests at multiple granularities and frees you from worrying about low level details such as syntax and semantics of various device vendors. We will show how Batfish, an advanced network configuration analysis framework, rises to this challenge.

The Software Test Pyramid

Before discussing the networking test pyramid, let us review the software test pyramid. Mike Cohn introduced the software test pyramid in his book “Succeeding with Agile” in 2009. Several versions of the pyramid have been proposed since. We use Martin Fowler’s version as our reference.

At the bottom, we have unit tests that check individual modules; in the middle, we have integration tests that check interactions between two or more modules; and at the top, we have end-to-end tests that check end-to-end behaviors such as responses to user requests. As we move from bottom to top, the focus changes from testing isolated aspects of the system to testing complex interactions.

Because there are fewer interactions at the unit test level, these tests tend to be faster to execute and their failures easier to debug. Unit tests can also check modules for a broader range of inputs that may be hard to reproduce as part of an integration or E2E test. However, these advantages of unit tests do not imply that higher-level tests can be ignored. They are closer to what users and applications experience and validate interactions that may have been left untested by unit tests because of a missing test case, the difficulty of testing, or incorrect assumptions about how other modules behave. There have been many an Internet meme that drive home this point.

Given the unique strengths of different pyramid levels, a good test suite contains tests at all levels. That helps you find and fix errors easily, achieve better test coverage, and directly validate user experience. By borrowing the test pyramid concept, we can develop networking test suites with similar advantages.

The Test Pyramid applied to Networking

Our adaptation of the test pyramid is shown below.

While it has a different number of levels and uses different terminology, it retains the essence of its software test pyramid. As we move from bottom to top, the focus changes from testing localized, isolated aspects to complex interactions. Listing the levels from bottom to top:

Configuration content: At this level, you test if device configurations contain what you expect. It includes checks such as whether interface IP addresses are correct, whether BGP peers are assigned to peer groups, whether access control list (ACL) names follow site standards, and whether certain lines appear in the config. You are not validating behavior at this level (which will happen later) but operating at the textual layer of configurations. You are primarily checking that human, coding, or source-of-truth error (depending on how you generate configs) has not led to improper configuration content.
Device behavior: At this level, you test the behavior of individual devices such as which packets its ACLs (access control list) permit/deny and how its route maps process BGP announcements. When you tested configuration content, you did not directly test behaviors which result from the interaction of many configuration lines. For instance, the behavior of an ACL depends on the order of ACL lines and on definitions of address groups that it uses; ACL behavior testing ensures that all of these aspects combine to produce expected behavior. As the debate about unit-vs-integration tests illustrates, it is important to test higher-level behaviors even when lower-level aspects are tested well.
Device adjacencies: At this level, you test if devices can establish the right adjacencies such as BGP sessions, GRE tunnels, and VPNs. These adjacencies form the foundation of network behavior, so testing them directly is important. It is possible to have a test suite where all configuration-content and device-behavior tests pass but protocol adjacency tests fail due to incompatibility between devices.
End-to-end: At this level, you test the end-to-end behavior of the network control and data planes, whether it propagates and filters routing information as expected and whether it forwards and drops traffic as expected. These tests are powerful because they directly test applications experience and exercise a broad range of underlying aspects. When these tests pass, applications are more likely to experience a functioning network and give you confidence that many of the underlying aspects are correct and interacting as expected.

If you wanted to map these levels to the software pyramid, one could think of a device as a module. So, the device behavior, device adjacencies, and E2E networking levels map respectively, to unit, integration, and E2E software levels. The software counterpart of configuration content testing would be analyses such as linting and checking compliance with formatting and variable naming rules. These non-behavioral checks are not in the software test pyramid, though software does undergo these tests (often as part of the build process).

Putting the Networking Test Pyramid to Practice

Now that we have learned the theory of the networking test pyramid, it is time to put it to practice. We do that by developing a test suite using pybatfish, the Python client of Batfish, for the data center network below. It is a multi-vendor network with an eBGP-based leaf-spine fabric (Arista), along with a firewall (Palo Alto), and border routers (Juniper MX) that connect to the outside world.

In this data center, there are many aspects you’d want to test to get confidence in its behavior and configuration changes. For brevity, we will focus on testing aspects related to connectivity between leafs and to the outside world. A comprehensive test suite will check many other aspects as well such as NTP servers, interface MTUs, connectivity to management devices, and so on. This GitHub repository has the code for all the tests below, implemented in the pytest framework.

Configuration content

We start with the lowest level of the pyramid, where we check configuration content. Given our focus on basic connectivity, we may write the following tests.

All interfaces IPs must have expected values. The expected values depend on the setup. Exact value for each interface could come from an IPAM or the requirement could be simply that all values should be drawn from certain prefixes and be unique. Our example tests use the second criteria.
All local ASNs must have expected values. Again, the expected values depend on the setup. Our example tests assume that each layer of the DC must use local ASNs within a range and their ASNs should be globally unique.
All configured BGP peers must have expected remote ASNs. This test is based on the ASN allocation above, so the remote ASNs of spine-facing peers on leafs must be within the range used for the spines.
All leaf routers must have EBGP multipath enabled, to load balance via ECMP.

Pybatfish implementations of all these tests is here. The eBGP multipath test is:

Line 96 uses Batfish’s bgpProcessConfiguration question to extract information about all BGP processes on leaf nodes. This information is returned as a Pandas DataFrame—Pandas is a popular data analysis framework and a DataFrame is a tabular representation of the data—in which rows correspond to BGP processes and columns to different settings of the process. Line 97 extracts BGP processes for which Multipath_EBGP is False. Finally, Line 98 checks that no such processes were found.

We see that tests take only a few lines of Python, and nowhere did we need to account for vendor-specific syntax or defaults. This simplicity stems from the structured, vendor-neutral data model that Batfish builds from device configs.

Device behavior

Having tested important configuration content, you can now test that various elements of the configuration combine to produce intended device-level behaviors. You may write the following tests.

The access control list (ACLs) at the border router must block incoming traffic from private address spaces and the set of malicious sources that we have identified.
The route map at the border router must filter routing announcements for private address spaces.
The route map at the border leafs must aggregate SVI prefixes before propagating them upwards in the data center.

Pybatfish implementation of these tests is here. The ACL test is:

Line 13 uses the searchFilters question of Batfish to find any traffic from the blocked IP space that is permitted by the ACL, and then Line 16 asserts that no such traffic is found. In contrast to the alternative of grep’ing for blocked prefixes in the device configuration, this test actually validates behavior. Grep-based validation is fragile because a line that permits some or all of the blocked traffic may also appear in the configuration, resulting in you falsely concluding that the traffic is blocked.

Device adjacencies

Now that we have validated the behavior of individual devices, we can validate different types of adjacencies between devices using these tests:

All pairs of interfaces that we expect to be connected (based on LLDP data or a source of truth like Netbox) must be in the same subnet.
All (and only) expected BGP adjacencies must be present.

Pybatfish implementation for both tests is here. The second test is:

This code is using the bgpEdges question of Batfish to extract all established BGP edges in the network. BGP adjacencies that do not get established due to, say, incompatible peer configuration settings, will not be returned. The question returns an edge per DataFrame row, which Line 19 transforms into a set of all node pairs. Line 21 asserts that this set is identical to what we expect based on the source of truth.

End-to-end

With the lower-level tests in place, you are ready to test important end-to-end aspects of the network’s control and data planes. You may write the following tests:

The default route must be present on all routers. This route comes into the data center from outside. We want to test that it is propagated everywhere. Traffic to the Internet will be dropped if the default route is accidentally filtered somewhere.
Each leaf router must have the route to all SVI prefixes. Because end hosts are in these prefixes, this test checks that routing within the data center is working well and all pairs of hosts have a path to each other.
All public services hosted in the data center must be reachable from the Internet. No valid connection request should be dropped by the network.
No private service must be reachable from the Internet. This security property must hold no matter how an adversary crafts their packet.
Key external services, such as Google DNS and AWS, must be accessible from all leaf routers.

Pybatfish implementations of all these tests is here. The public services test is:

For each public service, it is using the reachability question of Batfish to find valid flows that will fail. Valid flows are defined as those starting at the Internet (Line 11), have a source IP that is not among blocked prefixes (Line 13), have a source port among ports that are not blocked anywhere in the network (Line 14), and have destination IP and port corresponding to the service (Lines 15, 16). It is then asserting that no valid flow fails.

This Batfish test is exhaustive. It is considering billions of possibilities in the space of valid flows, and it will find and report if there is any flow that cannot go all the way from the Internet to the service. Such strong guarantees are not possible if you were to test reachability of public services via traceroute.

Wrap up

That wraps up our example test suite. Hopefully, you could see how the pyramid helps you think comprehensively about network tests and how Batfish helps you implement those tests. After you’ve defined a test suite for your network, you’d be able to run it for every planned change. Imagine how rapidly and confidently you’d then be able to change the network. That is the power of a good automated test suite!

Check out these related resources

Read “A practical approach to building a network CI/CD pipeline” to learn how to embed your test suite in an automation pipeline.

Closing the loop on testing network changes

2021-05-26T00:00:00+00:00

“..the best way to guard against error is to design systems with layered and overlapping defenses … like slices of Swiss cheese being layered on top of one another until there were no holes you could see through” - from The Premonition, Michael Lewis

This post was co-authored by Dinesh Dutt and Ratul Mahajan. It was also posted on Medium and The Elegant Network.

Network changes, such as adding a rack, adding a VLAN or a BGP peer or upgrading the OS, can easily cause an outage and materially impact your business. Rigorous testing is key to minimizing the chances of change-induced outages. A central tenet of such testing is test automation—a program should do the testing, not (error-prone) humans. Test automation should target all stages of the change process. Prior to deployment, it should test that the change is correct and that the network is ready for it. Post deployment, it should test that change was correctly deployed and had the intended impact. This “closed-loop test automation” makes the change process highly resilient and catches problems as early as possible.

But writing the code to automate network testing can be quite complicated. For instance, if you were adding a new leaf, prior to deploying your change, you may want to test that the IP addresses on the new leaf do not overlap with existing ones. So, you may write a script that mines addresses from configurations and then checks for uniqueness. Similarly, post deployment, you may want to test that all spines have the new prefix. So, you may write a script to fetch and process “show” data from all spines. Writing such scripts is fairly complex because you have to know the right commands, know how to extract the data (via textfsm or json queries), and so on. Sometimes, you have to combine information from multiple commands. And of course, this is different for every vendor, and in many cases, across different versions of the vendor. Your ability to automate network testing increases dramatically if you have tools that can take out much of the complexity. In this article, we cover two such open-source tools, Batfish and Suzieq, that help you easily automate closed-loop testing.

The three stages of closed-loop test automation

Closed-loop network testing has three stages:

Pre-approval testing: Before you schedule the planned change for a maintenance window, test that it is correct.
Deployment pre-testing: Before you deploy the change, test that the network is in a state where the change is safe to make. It is not uncommon that the current device state has drifted from what is assumed to have been configured.
Deployment post-testing: Right after you deploy the change to the network, test that the change produced the intended behavior.

The figure places these stages within a typical change workflow. Pre-approval testing is done when you are designing and reviewing the change, and deployment pre- and post-testing is done during the maintenance window.

Experienced network engineers will recognize that the three stages map to how manual change validation happens today. Peer-reviews fill in for pre-approval testing, and running show commands before and after the change fill in for deployment pre- and post-tests. We show how you can automate such validation. To be clear, we are not claiming that automatic testing can or should completely supplant human judgement. Rather, automation and humans can work together to make network changes more efficient and robust. This hybrid mode is akin to software, where developers use both a battery of automatic tests and peer reviews, and information produced by automatic tests greatly assists human reviewers.

Tools of the trade

Let us first introduce the tools of the test automation trade that we’ll use.

Batfish

Batfish analyzes network configuration to validate that the behavior of individual devices and the network as a whole matches the user’s intent. It constructs a model of the devices (Cisco router, Arista router, Palo Alto firewall etc.) and uses the device configuration files to put the model in the state specified by the configuration. It can then use formal verification analysis to comprehensively answer questions such as whether a BGP session is up, if an IP address can communicate with another, and so on. Because of its reliance on only device config files, it is a simple yet powerful tool that doesn’t require you to run slow and expensive emulation tools such as GNS3, EVE-NG or Vagrant to validate network behavior.

Suzieq

Suzieq is the first open source multi-vendor network observability platform that gathers operational data about the network to answer complex questions about the health of the network. It enables both deep data collection and analysis. It can fetch all kinds of data from the network, including routing protocol state, interface state, network overlay state and packet forwarding table state. You can then easily perform sophisticated analysis of the collected data via either a GUI or CLI or programmatically. Using either a REST API or the native python interface, you can also write tests that assert that the network is behaving as you intended (e.g., is this interface up? is this route present?).

You may be wondering: why do I need two tools? The short answer is that their testing capabilities are complementary. Batfish reasons about network behavior based on configuration (which may not have been deployed yet) and provides comprehensive guarantees for large sets of packets. Suzieq reasons about actual network behavior based on operational data. You use Batfish before pushing the change to the network for comprehensive analysis that the change is correct. This analysis assumes that the network is in a certain state at the time of deployment. Suzieq first helps validate those assumptions, and then also helps ensure that the change had the intended impact after it is deployed.

Testing focus aside, there are many similarities between the two tools that make them work well together. Both Batfish and Suzieq are multi-vendor and normalize information across vendors, so your tests are independent of the specific vendors in your network. Both have Python libraries that make it easy to build end-to-end workflows. And they both use the popular Pandas data analysis framework to present and analyze information. Pandas represents data using a DataFrame, a tabular data structure with rows and columns. You can find out the names of the columns in a DataFrame using the columns() method and use its powerful data analysis methods to inspect and analyze network behavior. A particularly useful method is “query” which filters rows/columns per user specifications. Some examples of using query are here.

Implementing closed loop testing

We now illustrate how Batfish and Suzieq combine to enable closed-loop network test automation. We will consider a trivial 2-leaf, 2-spine Clos topology, though the testing workflow and the example tests below apply equally to bigger, real-world networks. For brevity, however, we describe only a handful of tests in this post. A real test suite should have more tests that cover additional concerns. Developing a good test suite is an art form by itself, and we’ll cover this topic in a future post. The configuration fragments and tests used in this post are available in this GitHub repo.

Installing the Software

Download the Docker container for Batfish and launch the service as follows:

docker pull batfish/batfish docker run –name batfish -v batfish-data:/data -p 9997:9997 -p 9996:9996 batfish/batfish

Install pybatfish (the Python client for Batfish) and Suzieq in a Python virtual environment. You need at least Python version 3.7.1 to use these two packages.

pip install pybatfish pip install suzieq

A quick introduction to the Pythonic interfaces of Batfish and Suzieq is useful now. The Python APIs of Batfish are documented here. Batfish provides a set of questions that return information about your network such as the properties of nodes, interfaces, BGP sessions, and routing tables. For example, to get the status of all BGP sessions, you would use the bgpSessionStatus question as follows.

bfq.bgpSessionStatus().answer().frame()

The .answer().frame() part transforms the information returned by the question into a DataFrame that you can inspect and test using Pandas APIs.

Suzieq’s Python interface is defined here. Suzieq organizes information in tables. For example, you can get the BGP table via:

bgp_tbl = get_sqobject(‘bgp’)

Every table contains a set of functions that return a Pandas DataFrame. Two common functions are get() and aver() (because assert is a Python keyword, Suzieq uses aver, an old synonym). Because Suzieq analyzes the operational state of the network, you must first gather this state by running the Suzieq poller for the devices of interest. These instructions will help you start the poller on your network.

You are now ready to implement the first stage of closed-loop testing.

Pre-approval testing

The goal of pre-approval testing is to ensure that the change is correct. We show how a Python program using Batfish, the tool we’ll use in this stage of testing, helps you catch errors in your change. What exactly you test during pre-approval testing depends on the change. Let’s continue with the example that we’re adding a leaf to our network. We have a new config that we created using the existing config from one of the other leaves. Or maybe you have a template and you’re using Ansible to generate the config for this new device. But due to a cut-and-paste or a coding error, one of the interface IP addresses is that of another device, and not the one you’re deploying. A string of numbers is easy to miss even with a peer review. Batfish can easily catch such errors.

To start pre-approval testing with Batfish, put the configuration files of each router in a directory as described here. You can add leaf03’s config along with the modified spine configs to the configs subdirectory.

You can write Python programs that use the Batfish API to automate your pre-approval testing. Here is an example of such a program.

This program initializes the network snapshot (with planned config modifications) in init_bf() and defines two tests. test_bgp_status() uses the bgpSessionStatus question to validate that all BGP sessions will be established after the change. test_all_svi_prefixes_are_on_all_leafs() verifies that the SVI prefixes will be reachable on all nodes. It uses the interfaceProperties question to retrieve all SVI prefixes and verifies that each is reachable on all nodes. You retrieve the list of nodes using the nodeProperties question

TIP: The first time you use Batfish on your network, take a look at the output of bfq.initIssues().answer().frame() to confirm that Batfish understands it well. The output of this command is also a good thing to check when a test fails because problems such as syntax errors are also reported in it.

Hopefully, you now see the power of automated testing with tools like Batfish and Suzieq. A few lines of code can validate complex end-to-end behaviors across your entire network. When you add another leaf or spine, you can run this test suite as is. In fact, you can run the same test suite across different vendors. Our example network uses Arista EOS. You won’t have to change even a single line if it used Cisco or Juniper or Cumulus or a mix.

You can even use pytest, the Python testing framework, to run the tests and make full use of an advanced testing framework. If any of the assertions fail, pytest will report them, and you can investigate the error, fix the config change, and rerun the test suite.

Good testing tools also make it easy to debug test failures. How you do that depends on the test. For example, if we had assigned an incorrect interface IP address on the new leaf, test_bgp_status() would fail because not all sessions would be in ESTABLISHED state. You may then look at the output of bgpSessionStatus question, which for this example will show that the sessions on leaf03 and spine01 are incompatible. To understand why, you may then run the bgpSessionCompatibility question as follows.

This output tells you that you likely have the wrong IP address on leaf03 (NO_LOCAL_IP), and that spine01 expects to establish a session to 10.127.0.5 but no such IP is present in the snapshot (UNKNOWN_REMOTE). If you fix the configs, and rerun the tests, they should all pass now, and you can be confident that your change is ready to be scheduled for deployment.

Deployment pre-testing

Pre-approval testing happened against the network snapshot that existed then. When the time comes to deploy the change during the maintenance window, the network may be in a different state. Some links may have failed and the planned change could interact with the failures in unexpected ways. Or, the network’s configurations may have drifted in an incompatible way. We must thus test that the change is safe to deploy right before deploying it.

A combination of Batfish and Suzieq enables deployment pre-testing. Suzieq will fetch the latest network configs and state. You can feed those new configs to Batfish along with the planned config change and re-run the tests that you ran before. This re-run confirms that the change is still correct and is compatible with the current network configuration.

Suzieq helps you test that the network is in a state that is ready for the change you’re about to make. For example, if one of the spines is down, then attempts to configure it will fail. Alternatively, you must verify that the spine configuration change is using a port that is not already in use. It is important to double check our assumptions about the state of the network. Measure twice, cut once, as the adage goes. If there’s an unexpected surprise, you can abort proceeding with the change (no rollback needed).

As in the case of Batfish, your automated test suite will be a Python program. The following snippet shows how you can use Suzieq to test that the spines are alive, the port Ethernet3 being provisioned to connect to the new leaf is free, and that the SVI prefix being allocated is unused.

Each test uses get_sqobject() to get the relevant tables, then uses the get function to retrieve the rows and columns of interest, and finally checks that a specific column has an expected value on all nodes. The .all() checks that the field has that value on all rows of the retrieved dataset. Thus, the test to check that all spines are alive uses the “device” table to retrieve information about the spines, and checks that the “status” column has the value “alive” in all rows. test_spine_port_is_free() assumes that the spine ports have been cabled up and uses the lack of an LLDP peer to confirm that the port connecting to the new leaf is unused

Like Batfish, this code is vendor-agnostic and works for any Suzieq-supported vendor. If you add additional leafs, you just need to change the values of SPINE_IFNAME and NEW_SVI_PREFIX. This is the power of writing tests using frameworks like Suzieq.

If all deployment pre-tests pass, you can confidently deploy the change. But before you declare victory, you still need to test that the deployment went as planned. So, let’s do that next.

Deployment post-testing

Deployment post-testing aims to verify that the change was successful. A simple list of things to test our example change include: the spines are now peering correctly with the new leaf, the new SVI prefix is correctly assigned, and that the SVI prefixes of all the leafs are reachable via all the other leafs.

The Python program to test all this looks as follows.

Before running deployment post-tests, make sure that you have added the new leaf to the Suzieq inventory and restarted the poller to gather data from the new leaf as well.

Suzieq can also perform a battery of tests using the aver method of the table. For example, if you had accidentally deployed the config with the incorrect IP address on leaf03, you can use the interface aver which checks for consistency of addresses, MTUs, VLANs, etc. The output would look as follows:

Now you can examine the interface IP addresses of leaf03/Ethernet1 and spine01/Ethernet3 to determine which of the two has the incorrect IP address and fix it.

Summary

As in the software domain, rigorous testing is key to evolving the network rapidly and safely. Closed-loop testing leads to the least surprise with the most reliability. It is done in three stages—-pre-approval testing, deployment pre-testing, and deployment post-testing—-and catches errors at the appropriate point during the deployment. However, writing automated tests can be a complex process without the help of appropriate tools. Fortunately, open source tools like Batfish and Suzieq greatly simplify writing and maintaining automated tests. Give them a try and make your network changes robust and error free.

Automating the long pole of network changes

2021-05-18T00:00:00+00:00

When it comes to automating network changes, most network engineers want to start with automatic config generation and deployment. It just feels like that is the heart of the challenge, and it certainly feels like a fun thing to do.

But assume for a moment that you’ve automated config generation and deployment. Have you now significantly reduced the time it takes to service change requests?

If your network is like the many that we (or the good folks at NetworkToCode) work with, chances are that the answer is no. There are several critical and critical-path tasks that have not been automated, including:

Ensuring that the generated change is correct
Reviewing and approving the change
Testing the impact of the change post deployment (post-change testing)

Per Amdahl’s law, unless you automate these tasks too, your end-to-end gains from network automation will be limited. By automating config generation and deployment, you have likely shaved off only tens of minutes from the time it takes you to service a request, without making a big dent into the end-to-end time. This effect is illustrated in the figure below.

At the same time, without automating testing, the risk of a bad change making it to the network is still high. You cannot count on the change generation script always generating correct changes. It is all too easy for this script to overlook a legacy snowflake in the network or interact poorly with some recent changes (perhaps made manually). You thus need to still validate that the auto-generated change is correct and, for big changes, have a colleague review them for added assurance. This is an error prone task given the complexity of modern networks.

Change Reviews in Batfish Enterprise enable you to fully automate change testing, attack the long poles in your change workflows, and make network changes more reliable.

Let us illustrate how they work via an example: Allowing access to a new service from the outside. The service request may look like the following:

ticket_id	tkt1234
service_name	NewService
destination_prefixes	10.100.40.0/24, 10.200.50.0/24
ports	tcp/80, tcp/8080

Your change generation script will use the request parameters to generate the configuration commands for one or more devices. For example, it may generate the following change to the Palo Alto firewall at the edge of the network:

set service S_TCP_80 protocol tcp port 80 set service-group SG_NEWSERVICE members S_TCP_80 set service S_TCP_8080 protocol tcp port 8080 set service-group SG_NEWSERVICE members S_TCP_8080

set address tkt123-dst1 ip-netmask 10.100.40.0/24 set address-group tkt123-dst static tkt123-dst1 set address tkt123-dst2 ip-netmask 10.200.50.0/24 set address-group tkt123-dst static tkt123-dst2

set rulebase security rules tkt123 from OUTSIDE set rulebase security rules tkt123 to INSIDE set rulebase security rules tkt123 source any set rulebase security rules tkt123 destination tkt123-dst set rulebase security rules tkt123 application any set rulebase security rules tkt123 service SG_NEWSERVICE set rulebase security rules tkt123 action allow

This change may be generated using Jinja2 templates, an internal source-of-truth like Netbox, or the Palo Alto Ansible module. Regardless of how it is generated, you can submit it to Batfish Enterprise and analyze it using three criteria.

Criterion 1: The change should have the intended behavior

The first criterion is that the change should result in the intended behavior. In our example, that means the firewall should allow the traffic to the service and external sources should be able to reach the service. Batfish Enterprise provides a variety of test templates that enable you to validate the network behavior before and after the change (pre- and post-change tests).

For our example, you would use the Service Accessibility template to test that the service is not accessible from the Internet before the change and is accessible after the change. This test can be expressed using the following YAML:

You would also use the Cross-Zone Firewall Policy template to test firewall behavior for service traffic from outside zone to inside zones. It should not allow this traffic before the change and allow it after the change. Both end-to-end service accessibility and firewall-focused tests are useful because they can yield different results. The firewall-focused test may pass and the end-to-end test may still fail if the traffic is blocked elsewhere on the path.

You would generate these change-behavior tests as part of change generation. Thus, based on the service request parameters, you are generating the change and the tests for the change.

Criterion 2: The change should not violate network policy

In addition to the change meeting its behavioral specification, you need to also ensure that it does not violate any network policy. Network policies are behaviors that must always be true, independent of the change. For example, certain subnets must never be accessible from the Internet and access to the main corporate Website must never be blocked. It is possible to have a situation where change behavior tests pass but a key network policy is violated, for instance, if request parameters are wrong. Such changes must not reach the network.

With Batfish Enterprise, once you’ve configured the network policies, they are evaluated automatically for each change. Policies use the same basic templates as change behavior tests. A policy that certain services must never be accessible from the Internet will use the Service Protection template as follows:

Criterion 3: The change should not cause collateral damage

A final criterion to build confidence in the change is testing that it does not do collateral damage, e.g., accidentally allowing more traffic than intended. Batfish Enterprise can predict the full impact of the change, including how the routing tables will differ and how traffic will be permitted or blocked by it. The screenshot below shows how the network connectivity is impacted by our example change. We see that the Internet can now reach leaf40 and leaf50, where the two prefixes are hosted, and no other traffic will be allowed or denied as a result of the change.

These differences enable you to determine if the change has exactly the impact you want–no more, no less. Using Batfish Enterprise APIs you can assert that the change does not allow traffic to any leaf routers other than the ones corresponding to the service request or that the routing tables are not altered on any router.

Rapid iteration

When you evaluate a change per the criteria above, you may find that one or more tests fail. Batfish Enterprise helps you debug such failures quickly. For our example, we may find that while we successfully opened the firewall, end-to-end connectivity is still missing. Batfish Enterprise will then output (screenshot below — the same information is available programmatically) an example flow (within the traffic we intended to allow) that cannot reach NewService. It will also show that this traffic is being dropped at the border routers and that there is in fact a different flow that can reach the service. We can see that the difference between these two flows is the destination port: tcp/8080 versus tcp/80.

What is happening here is port 8080 is not open at the border routers. Thus, we have a situation where some traffic can reach NewService and some cannot. Without the comprehensive analysis provided by Batfish Enterprise, you may incorrectly infer that the change is correct.

Importantly, Batfish Enterprise is finding such problems with the planned change prior to the maintenance window. Without it, you may discover such problems above during the maintenance window. At that point, you’d have to roll back the change, debug it, and schedule it for another maintenance window. That would have significantly stretched the service request time.

After determining the root cause, you could modify your change generation script and rerun all tests with its new output. Or, because script modifications take time, you may want to first determine the correct change manually to close the service request in a timely manner. Batfish Enterprise lets you mix automatic and manual changes and test their combined impact.

Simplified reviewing

To facilitate change reviews, you can attach the results of all the change behavior tests, the results of network policy evaluation, and the full impact report of the change to the service request ticket. This information makes it super easy for the reviewer to approve the change. Once approved, you deploy the change to the production network with the confidence that it will work exactly as intended.

Summary

Chances are that your team spends more time on ensuring that the config change is correct than on generation and deployment. To realize the end-to-end gains from automation, you must automate not just config generation and deployment, but also testing. Batfish Enterprise enables you to build an end-to-end automation pipeline that automates testing and simplifies reviews. The result is a network that moves fast and does not break.

Check out these related resources

Demo video that shows this specific example in action

Test drive network change MOPs without a lab

2021-04-30T00:00:00+00:00

Imagine that you could predict and test the full impact of every single change to the network. Imagine also being able to do this within minutes, for the production network itself (not a small-scale replica), and without having to set up and maintain a test lab. Will this capability enable you to reduce the risk of outages and breaches? Will it enable you to be more responsive to the changing business needs of your organization?

Change Reviews in Batfish Enterprise enable you to realize these agility and resiliency benefits by letting you test drive your change MOP (method of procedure) on a twin of your production network. If you like the results, you can confidently push the change to production.

Let us consider an example change: adding a new subnet to the data center fabric. Your MOP may look like the following.

MOP for adding a new subnet to a leaf in the DC fabric

User inputs

10.250.89.0/24: The subnet prefix to add
leaf89: The leaf router where to add the subnet

Pre-change tests

Route to 10.250.89.0/24 should NOT already exist in the fabric. Log into a spine router, run “show ip route 10.250.89.0/24”, and check that the subnet prefix is not present.

Change commands

Log into leaf89 and enter the following commands after double checking that Ethernet7 is shutdown and vlan 389 is unused.

interface Ethernet7
  switchport mode access
  switchport access vlan 389
interface Vlan389
  ip address 10.250.89.1/24
  no shutdown
router bgp 65089
  address-family ipv4 unicast
     network 10.250.0.0/16

Post-change tests

Route to 10.250.89.0/24 should exist in the fabric. Log into a spine router, run “show ip route 10.250.89.0/24”, and check that the subnet prefix exists.
Route to 10.250.0.0/16 aggregate that covers the subnet should exist at the border routers and the firewall. Log into a border router, run “show ip route 10.250.0.0/16”, and check that the aggregate route exists.

Rollback commands and tests

// Skipped for this article
// Batfish Enterprise can help test the rollback procedure as well

To test drive this MOP in Batfish Enterprise, you would first specify the planned implementation. Simply select the device from the list and enter the configuration commands planned for it. When you do that, Batfish Enterprise will check that the commands are valid. In the screenshot below, for instance, it is warning that the interface address is not correctly specified.

You would next specify your tests. Batfish Enterprise test templates allow you to check all manners of network behaviors. For our example change, as shown below, you would use the Devices Have Routes template to test that the subnet route is not present before the change (a pre-change test) and is present after the change (a post-change test).

Compared to what you might do during the maintenance window, this test will check for the presence of the route on all devices, not just one or two you might log into while making the change. You would similarly specify the second post-change test to ensure that the /16 aggregate is present on the border routers and the firewall.

After entering the configuration commands and tests, you tell Batfish Enterprise to evaluate the change. If a test fails, Batfish Enterprise will show you which devices are failing and why. Below, we see that it is showing that the second test (about the aggregate) is failing, and the aggregate prefix is not present on the border routers and the firewall.

If you had applied this change during the maintenance window, you’d have to rollback the change, debug it, and then schedule it for a future maintenance window. With Batfish Enterprise, you make such discoveries ahead of time, and the change will be successfully executed in one maintenance window.

Based on the information provided by Batfish Enterprise, you’d quickly realize that the test is failing because the subnet prefix (10.250.89.0/24) is not covered by any of the existing aggregates announced by the border leafs. Past subnet additions succeeded because those prefixes were drawn from existing aggregates. The fix is easy once you make this discovery. You would add another aggregate to the list announced by the two border leafs and update the change specification within Batfish Enterprise. The screenshot below shows the change commands for bl02, one of the border leafs.

When you run validation again now, Batfish Enterprise will consider the combined impact of changes across all devices and tell you if any test is still failing.

Batfish Enterprise offers two additional ways for you to gain confidence in the change. If you have defined network-level policies—-behaviors that must always be true of the network (e.g., the corporate Website should always be accessible from the Internet)—-it will check that the change does not violate any of those.

Further, it enables you to see the full impact of the change, including changes in RIBs, FIBs, and end-to-end connectivity. The two screenshots below show that 1) two new prefixes (the subnet prefix and the aggregate) will be added to leaf devices, and 2) new flows will be allowed from the Internet to leaf89 (where the subnet was added). The connectivity between the internet and other leafs is unchanged. Using such views, you can easily verify that the change has exactly the intended impact–no more, no less.

You are now done. Within a matter of minutes, you have validated that your planned change passes your tests, does not violate your network policy, and has exactly the intended impact. You can now enter the maintenance window with confidence!

Check out these related resources:

Demo video that shows this specific change example in action.
Demo video that shows even automated changes can be fully tested before deployment.

Network test automation: Rock, Paper, Scissors, Lizard, or Fish?

2021-03-31T00:00:00+00:00

When building a network automation pipeline, one of the most important questions to consider is: How do you test network changes to prove that they will work as intended and won’t cause an outage or open a security hole? In a world without automation, this burden falls on network engineers and approval boards. But in a world where network changes are automated, testing of changes must be automated as well.

There are multiple types of testing that you should consider, such as:

Data validation (e.g., the input for an IP address is a valid IP address)
Syntax validation (e.g., configuration commands are syntactically valid)
Network behavior validation (e.g., firewall rule change will permit intended flows)

I will focus on network behavior validation in this blog. It provides the strongest form of protection by validating the end-to-end impact of changes.

How can you validate network behavior that a change will produce before the change is deployed to the production network? You have two options.

You can build a lab that emulates the production network, using physical or virtual devices, and apply and test the changes there.
You can simulate the change using models of network devices.

GNS3 is a popular emulation tool, and Batfish is a comprehensive, multi-vendor simulation tool. The title of this blog post is a play on their logos (and of course Big Bang Theory), though the discussion below applies equally to other emulation tools such as EVE-NG and VIRL.

As Batfish developers, we are frequently asked if engineers need both tools. The answer is: Both tools should be part of your testing toolkit as they are built to solve different problems. The table summarizes our view which I’ll discuss in more detail below. Before proceeding, I should add that GNS3 is an excellent tool and we use it extensively to build and test high-fidelity device models in Batfish.


Correctness guarantees	✔	✘
Configuration compliance and drift	✔	✘
High-level, vendor-neutral APIs	✔	✘
Embed in CI	✔	⚠ (Slow, resource heavy)
Analyze production network’s twin	✔	⚠ (Rarely possible)
Test new software versions and features	✘	✔
Test performance	✘	✔ (Some scenarios)
Our recommendation: Use the fish for day-to-day configuration changes, and use the lizard for qualifying new software /assets/images and lighting up new features.

Unique Batfish strengths

Batfish has three unique strengths. First, only Batfish can provide correctness guarantees that span all possible flows. For instance, when opening access to a new /24 prefix, you may want to know that no port to that destination prefix is blocked from any source, or that you have not accidentally impacted any other destination. Such guarantees are not possible in GNS3 but are almost trivial in Batfish.

Second, with Batfish you can not only test that the configuration produces the right behavior, but also that it complies with your site standards and has not drifted from its desired state. Batfish builds a vendor-neutral configuration model that you can query to validate, for instance, that the TACACS servers are correct and that the correct route map is attached to each BGP peer. This is not possible with GNS3.

Finally, while both tools will let you check network behaviors by running traceroutes and examining RIBs, only Batfish offers simple, vendor-neutral APIs. These APIs make it trivial, for instance, to check that packets take all expected paths (ECMP) and understand why a traceroute path was taken. Doing such inferences in GNS3 will take lots of careful test generation, vendor-specific parsing of RIBs and “show” data, and manual correlation.

Thus, Batfish enables comprehensive testing with strong correctness guarantees and vendor-neutral APIs. As we’ll see next, it also enables you to mimic your full production network and embed testing in your CI framework.

Overlapping capabilities but advantage Batfish

There are two testing goals for which you may consider either tool in theory, though Batfish is the pragmatic choice. First, you can invoke either tool in your CI (continuous integration) framework such that the analysis is run automatically for each change. Because of the time it takes to run GNS3 tests and the amount of resources it will consume, such usage is only a theoretical possibility for most networks. With Batfish today, many users run a wide array of tests, which finish within minutes for even networks with thousands of devices.

Second, ideally, you want the tests to be run in an environment that closely mimics your production network. Creating an exact replica of the production network is nearly impossible with GNS3. Software /assets/images may not be available for some devices; they are certainly not available for cloud gateways. Or, you may run into other limitations such as the number of allowed ports or different interface names. You will need to get around such limitations by creating a smaller network and modifying your network configs, which leaves testing gaps. Batfish can easily mimic your entire production environment, including cloud deployments.

Unique GNS3 strengths

A unique strength of GNS3 is that it can help qualify software /assets/images that you are going to run in production. It can test that the new versions of device software run in a stable manner and that features that you haven’t used before work as expected.

GNS3 can also help you test some aspects of performance. You can test that the device can process large routing tables (though not that it does not drop packets under high load, for which you need real hardware). You can also emulate different link delays and loss rates to evaluate the impact of network conditions on, for instance, on event monitoring systems.

One observation I make is that these types of tests do not need to be done frequently. They are needed only when you upgrade software or change network design to use new features. They are not needed for day-to-day changes such as adding firewall rules, updating route policies, or provisioning new racks.

Summing it up

The choice of network testing tool depends on testing goals. To flag potential bugs in vendor software, you need GNS3. To find errors in your network configuration, you need Batfish. Both testing goals are important. Thus, we recommend using the lizard to qualify software /assets/images and the fish for day-to-day network configuration changes. That way you will lower the risk of software bugs causing network outages, and you will have configuration change testing that is comprehensive and production-scale.

Acknowledgement: Thanks to Titus Cheung (HSBC Equities Regional Infrastructure Manager and Architect) for sharing his insights on where each tool applies and providing feedback on this article.

Don’t be afraid of (network) change

2021-03-10T00:00:00+00:00

Companies large and small crave agile, resilient networks. They crave infrastructure that adapts rapidly to business needs without outages or security breaches. But changing the network is a risky proposition today, be it adding a firewall rule or provisioning a new rack. 50-80% of network outages are caused by bad network configuration changes. This high level of risk forces networking teams to tread carefully (and slowly) and prevents them from automating network changes.

No wonder Dilbert does not want to be the one updating the firewall.

Network changes today center around MOPs (method of procedure) that outline the steps to implement a change, for instance:

Run pre-change test (e.g. “show” commands) to test that the network is ready for the change
Enter configuration commands that implement the change
Run post-change tests to confirm that the change worked
If something went wrong, roll back the change

MOPs are built by network architects and engineers based on their knowledge of the network and are reviewed by Change Approval Boards (CAB).

In any large network that has been operational for months, and is managed by multiple engineers, there invariably are special cases (“snowflakes”) that are all-too-easy for engineers and the CAB to miss. The inability of humans to consider all such cases is what creates the risk of change-induced outages.

For some assurance of correctness, MOPs for complex changes are tested in physical and virtual labs and the test reports are included in the review process. While labs are helpful, they rarely mimic the full production network (especially the snowflakes). It is also impossible in a lab to run all test cases needed to fully validate a change—-fully validating a firewall change to isolate two network segments must consider billions of test packets. The gap between the lab and production networks and the incompleteness of tests, means that the CAB does not have a complete view of the impact of the change. The risk of them approving an incorrect change is still high.

Today, it is almost impossible to guarantee that network changes won’t disrupt critical services or open security holes.

We built the Change Review workflow in Batfish Enterprise for provably safe network changes. This feature enables network engineers and operators to comprehensively validate MOPs. You specify the change commands and pre- and post-change tests (interactively or via API calls). Batfish Enterprise then simulates your full production network (with all of its snowflakes), predicts the full impact of the change (for all traffic), and flags any test failures.

You can now be confident that the planned change is correct and can be safely deployed to the network. You can also attach the proof-of-correctness test reports to change management tickets, making CAB reviews easier and faster.

With the new Change Review workflow in Batfish Enterprise, you can ensure that the security and availability of the network is never compromised by a configuration change.

The rigorous validation of the MOP and full visibility into the impact of the change will enable you to reduce outages and dramatically increase change velocity. These correctness guarantees are also the foundation upon which you can automate the network change workflow.

See the solution in action in these videos.

Validating the validator

2020-12-18T00:00:00+00:00

Batfish provides a unique power to its users: validate network configurations before pushing them to the network. Its analysis is production-scale—unlike with emulation, you don’t need to build a trimmed version of your network. It is also comprehensive—considers all traffic, not just a few flows. These abilities help network engineers proactively fix errors that are responsible for 50-80% of the outages.

Batfish finds these errors by modeling and predicting network behavior given its configuration. The higher the fidelity of Batfish models the better it will be at flagging errors.

So, the question is: How do we build and validate Batfish models?

As any network engineer will testify, accurately predicting network behavior based on configuration is super challenging. Device behaviors differ substantially across vendors (Cisco vs. Juniper), across platforms of the same vendor (IOS vs. IOS-XR), and sometimes even versions of the same platform. Further, it is impossible to build high-fidelity models based solely on reading RFCs or vendor docs. RFCs are silent about vendor config syntax, and vendor docs are often incomplete, ambiguous and sometimes even misleading. And don’t even get me started on how wrong the Internet is— to see what I mean, try using it to figure EIGRP metric computation.

To appreciate the need to go beyond RFCs and docs, consider the following FRR configuration.

!
ip community-list 14 permit 65001:4
ip community-list 24 permit 65002:4
!
route-map com_update permit 10
 match community 14
 on-match goto 20
 set community 65002:4 additive
!
route-map com_update permit 20
 match community 65002:4
 set community 65002:5 additive
!
route-map com_update permit 30
 match community 24
 set community 65002:6 additive
!

If a route with community 65001:4 is processed by this route map, which communities will be attached in the end?

Will 65002:5 be attached? If you answered ‘yes’, you’d be wrong. The community list referenced on line 11 (65002:4) is not defined and hence the match will not occur.
Will 65002:6 be attached? If you answered ‘no’, you’d be wrong. Line 15 will match because the community 65002:4, which was attached earlier in Line 8, matches list ‘24’.

Correctly predicting the behavior of this route map requires that you know 1) that FRR expects defined community lists in ‘match community’ statements, 2) what happens when an undefined list is mentioned, and 3) if ‘match community’ statements can match on communities attached by earlier statements in the route map or only on the original set of communities. It is not easy to glean all this information from the docs.

That is why Batfish models are guided by actual device behaviors. Benchmarking these behaviors in our labs and observing them in many real networks helps us build the models and continuously improve their fidelity.

To build models for a feature on a particular platform and version, we use three types of benchmarks to capture the device behavior in detail.

Feature-level benchmarking. We start by lighting up the device of interest in our lab. Commonly, we use virtual /assets/images but need physical devices when that is not an option. We then configure the feature of interest in various ways and observe its behavior for different inputs. For FRR route maps, for instance, we would configure the route map with defined/undefined community lists and inject routes with different communities.
Device-level benchmarking. Network configuration would be a lot simpler if features didn’t interact with each other but the reality is different. For instance, a firewall that supports both network address translation (NAT) and zone filters, one needs to fully understand if filters are applied post-NAT operations and if they operate on pre- or post-NAT headers. To account for feature interactions, we create more complex scenarios where the features are jointly configured in different ways and benchmark the device behavior in each of those scenarios.
Network-level benchmarking. We also construct larger network topologies with multiple devices (possibly of different vendors) configured in common patterns, to help validate end-to-end behaviors.

Batfish models faithfully mimic all the behaviors in our benchmarks. Model building and the benchmarking steps are not executed as a strict waterfall. Rather, we follow an iterative process, where we refine the models in successive iterations. For example, network-level benchmarking may uncover a modeling gap, for which we’d go back to Step 1 to fully understand the gap.

The most challenging part of this process is devising interesting test cases. We can do it mainly because of the experience of our engineers and help from the Batfish community members who contribute test cases and report issues.

Ensuring model fidelity is not a one-time process. It is possible that we missed a feature interaction or that the model fidelity is compromised when we extend Batfish to more platforms and features. Two activities help us guard against these risks. First, when on-boarding a new network and then periodically after that, we compare the show data (interfaces, RIBs, FIBs, etc.) from the network to what is predicted by Batish. This helps flag any feature interactions that are not modeled. This way, as Batfish encounters more networks, its fidelity keeps improving.

Second, as part of our nightly CI (continuous integration) runs, we check that the network state computed by the latest Batfish code base continues to match the show data from real networks and our benchmarking labs. This helps quickly catch unintended regressions in model fidelity.

All said and done, are Batfish models guaranteed to be perfect? No, but neither are humans tasked with checking configuration correctness today. Think of Batfish as a lethal config-error-killing force that combines the knowledge of the most knowledgeable network engineer you have met for each platform that you analyze. Even those engineers might miss an error. But unlike them, Batfish will never forget the platform idiosyncrasies it has learned over time, and it will always catch situations where your Seattle change interacts poorly with your Chicago configuration. However, Batfish will not go with you to grab a drink to celebrate the complex change you executed without a hitch.

Lesson from a network outage: Networks need automated reasoning

2020-10-28T00:00:00+00:00

In the afternoon of October 23, within a few minutes of each other, two friends sent me a link to the recently-released “June 15, 2020 T-Mobile Network Outage Report” by the Public Safety and Homeland Security Bureau (PSHB) of the FCC. Given what Intentionet does, the report sounded interesting and I started reading it immediately.

The report details the massive impact of T-Mobile’s network outage. It lasted over 12 hours, and “at least 41% of all calls that attempted to use T-Mobile’s network during the outage failed, including at least 23,621 failed calls to 911.” The report also shares some individual stories behind the staggering numbers. For example, “One commenter noted that his mother, who has dementia, could not reach him after her car would not start and her roadside-assistance provider could not call her to clarify her location; she was stranded for seven hours but eventually contacted her son via a friend’s WhatsApp.”

“The outage was initially caused by an equipment failure and then exacerbated by a network routing misconfiguration that occurred when T-Mobile introduced a new router into its network.”

As someone who is working to eliminate such outages, parts of the report that discuss its root causes were illuminating. I learned that “the outage was initially caused by an equipment failure and then exacerbated by a network routing misconfiguration that occurred when T-Mobile introduced a new router into its network.”

Reading through the sequence of events that caused the outage, I could not help but conclude that this outage was completely preventable. T-Mobile should have known that their network is vulnerable to the failure and should have also known that the configuration change was erroneous before making the change.

Anatomy of the outage

Let me explain using the same example network as the report, shown below. The network runs the Open Shortest-Path First (OSPF) routing protocol in which each link is configured with a weight and traffic uses the least weight path to the destination. The left diagram shows such a path from Seattle to Miami when all links are working, and the right diagram shows the path when the Seattle-Los Angeles link fails.

The notable thing is that the paths taken by the traffic are deterministic and knowable ahead of time, before any link weight is (re)configured or any failure occurs. For a large network, this computation tends to be too complex for humans, but computers are great for this type of thing.

For the T-Mobile outage, the first trigger was a link weight misconfiguration, which caused traffic to take unintended paths that were not suitably provisioned, thus causing an outage. This error could have been completely prevented by analyzing the impact of the planned change and ensuring that it did the right thing.

The second trigger was a link failure, which caused even more traffic to take unintended paths. This trigger was not preventable—equipment failures are a fact of life in any large network—but networks are designed to tolerate such failures. However, the fault-tolerance that T-Mobile thought existed in their network did not because of a configuration error. Whether this was due to the same misconfiguration as the first trigger or a different one is not clear from the report. In any case, the adverse impact of the failure could have been prevented by simulating failures and ensuring that the network responds correctly to failures.

Another factor behind the outage was a latent bug in the call routing software used by T-Mobile. It appears that the software had not been tested under the conditions induced by the events above.

Finally, T-Mobile’s attempts to alleviate the situation made it worse. They deactivated a link that they thought would divert the traffic to better paths but instead worsened the congestion. This also could have been prevented by analyzing the impact of link deactivation before deactivating it.

Automated reasoning to the rescue

While the outage was preventable, it would be unfair to pin blame on T-Mobile’s network engineers. Reasoning about the behavior of large networks is an incredibly complex task. Large networks have hundreds to thousands of devices, each with thousands of configuration lines. Judging the correctness of the network configuration requires network engineers to reason about the collective impact of all these configuration lines. Further, they need to do this reasoning not only for the “normal” case but also for all plausible failure cases, and there are a large number of such cases in a network like T-Mobile’s. It is simply unreasonable to expect humans, no matter how skilled, to be able to judge the correctness of configuration or predict the impact of changes.

To overcome these limits of human abilities, we must employ an approach based on software and algorithms, where the correctness of configuration and its response to failures is automatically analyzed. Fortunately, the technology to do this already exists. Tools such as Batfish, to which we at Intentionet contribute, can easily do this type of reasoning at the scale and complexity of real-world networks. As an example, see a blueprint we published last year on how failures can be automatically analyzed: Analyzing the Impact of Failures (and letting loose a Chaos Monkey).

Industry-wide change is needed

I certainly do not mean to single out T-Mobile. The problem is systemic and similar outages have happened at other companies as well. To list a few from this summer alone:

Based on my understanding of these outages, each was avoidable if effective and automatic configuration analysis using tools like Batfish were employed. In fact, studies show that 50-80% of all network outages can be attributed to configuration errors.

Modern society relies on computer networks, and this reliance has increased manifold this year as many of us have started working remotely. The networking community must respond by building robust infrastructure where outages are highly rare. We cannot continue to rely on human expertise alone to prevent outages and must start augmenting it systematically with automated reasoning. The hardware and software industries made this leap many years ago and experienced a dramatic improvement in reliability.

We have the technology and the ability, and we should have plenty of motivation. All we need is a collective will to fortify our defenses. It is time.

Batfish

Stopping Network Outages Before They Start

Incrementally automating your network

The networking test pyramid

The Software Test Pyramid

The Test Pyramid applied to Networking

Putting the Networking Test Pyramid to Practice

Configuration content

Device behavior

Device adjacencies

End-to-end

Wrap up

Closing the loop on testing network changes

The three stages of closed-loop test automation

Tools of the trade

Batfish

Suzieq

Implementing closed loop testing

Installing the Software

Pre-approval testing

Deployment pre-testing

Deployment post-testing

Summary

Automating the long pole of network changes

Criterion 1: The change should have the intended behavior

Criterion 2: The change should not violate network policy

Criterion 3: The change should not cause collateral damage

Rapid iteration

Simplified reviewing

Summary

Test drive network change MOPs without a lab

MOP for adding a new subnet to a leaf in the DC fabric

Pre-change tests

Change commands

Post-change tests

Rollback commands and tests

Network test automation: Rock, Paper, Scissors, Lizard, or Fish?

Correctness guarantees

Configuration compliance and drift

High-level, vendor-neutral APIs

Embed in CI

Analyze production network’s twin

Test new software versions and features

Test performance

Unique Batfish strengths

Overlapping capabilities but advantage Batfish

Unique GNS3 strengths

Summing it up

Don’t be afraid of (network) change

Validating the validator

Lesson from a network outage: Networks need automated reasoning

Anatomy of the outage

Automated reasoning to the rescue

Industry-wide change is needed