Why wait? Eventual Consistency and Reliability

Jul 16, 2023 · 3 min read · Network Automation Programmability Design Patterns FOSS/Linux Unearned Uptime ·

Patience is tough when deploying automated code; Here's why it is important

Reliability-centric infrastructure engineers need to focus on careful, procedural, validated workflows; the systems we're responsible are simply too important to casually "toss" infrastructure requests at a common API gateway.

We can't really avoid automation either. Here's an example workflow:

<a href="simple-01.svg">Create a BGP Peer</a>

Easy, right? There are a few issues with simply coding against this workflow:

We should define a standard format for BGP Peers to make the process re-usable
We should define what information we should send to the BGP Peer
We should define what we expect the change to be
We should test to ensure the expected change occurred (ideally before rolling out to production)
We should revert the change if it doesn't produce the exact result we want

Infrastructure-as-Code

Now - the modified workflow may seem complex and require some level of acclimation. Computers don't mind processing thousands of rows of data, or even millions - humans are a little more error-prone at that scale. Let's offload some of those tasks to the computer like this:

<a href="bgp-peer-01.svg">Create a BGP Peer Safely</a>

Implementing Infrastructure-as-Code in this case achieves several benefits at once:

Engineers end up with a "spec" defining a router or device without having to compile it from the configuration
New implementations of a given standard can expose "build against the spec" interfaces to make revising infrastructure trivial
The implementation process for a given change can be standardized across engineers and continually improved upon
Everything is carefully logged by the CI tool automatically

The Downsides

For the purposes of this post, let's ignore the debate about abstraction and obfuscation, and examine why eventual consistency matters to achieve this goal.

Infrastructure Engineering

Infrastructure engineers have wildly different values from a typical developer. In a nutshell:

Measure twice, cut once doesn't work as a principle (didn't check enough times)
Slow is safe, and safe is fast
There are permissible and impermissible times to perform infrastructure work
- This imposes time limits on work, which violates principles #1 and #2

We have some problems here. Company leaders want to minimize downtime, and enforce aggressive maturation cycles. Once the gear stops falling apart, the biggest danger to availability quickly becomes the infrastructure engineers themselves. This leads to shortening of maintenance windows, which leads to rushed work, which then leads to more pain.

I'd like to propose a different workflow.

Network Assurance

Let's try a new workflow:

<a href="change-01.svg">Change Management Process</a>

In this world, we shift focus from the pressures of change execution to the change itself. The procedure itself should exist as-code (and ideally automated); we want to leverage a common concept in trades, cognitive loading.

Picture your mind as a physical workspace. All people are less efficient with a cluttered workspace. Instead of the past year's unfinished projects, the cognitive loading originates from stressors within the environment:

Am I going too fast?
Am I missing anything?
Do I have enough time to finish?
What if I missed something?

Early in the IT industry's maturation cycle, IT leadership pushed for the implementation of Standard Operating Procedures to act as a guide while executing a change, dramatically improving reliability outcomes.

Complexity is factorial in nature, and our human brains (a mental workspace) do handle this problem well, up to a point. Once we overwhelm our engineers, that's when mistakes happen - we need to leverage our computers to help with that. This is why we implement the procedure itself as code - the engineers construct the programmatic instructions themselves and continually improve on it with source control and peer review tooling (pull requests).

Demanding that we do things fast detracts from this, engineers should focus on the procedure and sequence of events when planning changes. This shifts their mental workspace to focus on delivering reliability.

It's not about the code, it's just another example of using a computer to better engineer solutions.