Batfish Behaviour TL;DR - Impact Analysis and Named References

Today I want to share with you some behaviour that I learnt the other day that you might find helpful when working with Batfish.

This behaviour is based upon the following:

When working with impact analysis and using layer1 topologies in Batfish. Don't use named references within your reachability questions.

Before I begin. If you do not know what Batfish is, here is a TL;DR:

Batfish models your network and allows to you verify your network changes prior to deploying them to the network. For example, you can simulate node and/or link failure and confirm how the network will behave (read more on usecase/features here).

Breakdown

Okay, why should we not use named references with layer 1 edges and impact analysis? Let's step through why...

When performing an impact analysis in Batfish, we typically perform something like this:

# Fork the base snapshot (a snapshot is a collection of network configs).
# In the fork we apply the failure (i.e deactivating a node).
bf_fork_snapshot(
    "base_snapshot",
    name="failure-snapshot-access2",
    deactivate_nodes=["ios-access2"],
    overwrite=True,
)

We then perform a query to check for any flow differences between the base and the failure snapshot. Like so:

bfq.differentialReachability(
    headers=HeaderConstraints(dstIps="server1, server2, server3")
).answer(
    reference_snapshot="base_snapshot",
    snapshot="failure-snapshot-access2",
).frame()

You can see that we have supplied the dstIps as named references. However, in our case, as we have also supplied a layer1 topology supplement to our snapshot to tell Batfish about the layout of the layer1 connections. This is typically used for modelling layer 2 domains with the network, as Batfish, by default, only uses layer 3 addressing to model the topology/what's connected to what.

Here lies the issue. At the point the node within the snapshot is deactivated, the connected interface (in our case server2) will (due to the layer1 supplement) also be marked as inactive. Because it is marked as inactive, the name reference will not be populated. Therefore the question will not return any flows for server2, even if there are differences between the base and failure snapshot.

On a side note, this behaviour has recently come about due to the recent enhancements in the last recent release v2021.11.04.

Batfish now takes Layer-1 information into account when performing failure analysis, and Layer-1 modeling (when users provide layer1_topology.json) is now faster, more accurate, and better documented. For example, if an Ethernet interface is down and that interface has a Layer-1 edge, then its paired physical advice will also be taken down.

The Fix

The fix is pretty straightforward. We just use the IP instead of the name within any of the inputs to our question, like so:

bfq.differentialReachability(
    headers=HeaderConstraints(dstIps="10.2.10.1, 10.2.20.1, 10.2.30.1")
).answer(
    reference_snapshot="base_snapshot",
    snapshot="failure-snapshot-access2",
).frame()

# Flow differences
>>> reach_diff.Flow
0     start=nxos-aggr1 [10.1.1.2->10.2.20.1 ICMP length=512]
1     start=nxos-aggr1 interface=Ethernet1/4 [10.1.1.3->10.2.20.1 ICMP length=512]
2     start=nxos-aggr1 interface=Ethernet1/5 [10.2.2.3->10.2.20.1 ICMP length=512]
3     start=nxos-aggr1 interface=Vlan10 [10.2.10.2->10.2.20.1 ICMP length=512]
4     start=nxos-aggr1 interface=Vlan20 [10.2.20.2->10.2.20.1 ICMP length=512]
5     start=nxos-aggr1 interface=Vlan30 [10.2.30.2->10.2.20.1 ICMP length=512]
...