Unearned Uptime: Letting Old Ideas Go

Mar 13, 2021 · 5 min read · Datacenter Networking Routing & Switching Studies Design Patterns Unearned Uptime ·

We don't always earn reliability with the systems we deploy, design, and maintain

Infrastructure reliability is a pretty prickly subject for the community - we as engineers and designers tend to anthropomorphize, attach, and associate personal convictions with what we maintain. It's a natural pattern, but it inflicts a certain level of self-harm when we fail to improve upon the platforms that serve as the backbone to those we support.

There are two major problems I perceive with regards to translating unearned uptime to reliability

History
Ego
Architecture (later post)

Throughout this article, I'll cover these problems and then transition into common examples of "unearned uptime" in the industry. These are not "networking" issues - it's an infrastructure issue. We have the same problems with most civil structures, interchanges, runways, etc.

The idea that we didn't earn reliability delivered to the business is one thing that we as infrastructure engineers and designers aren't particularly comfortable with.

History

It doesn't have a problem! It's been working fine for years!

Credit: Marc Olivier-Jodoin

Infrastructure needs routine replacement to function correctly

Consumers rarely notice issues with infrastructure until they've gotten to be truly problematic. An easy example of this is asphalt concrete (or bitumen, depending on where you live).

The material itself is relatively simple, rock aggregate + oil - but it's pretty magical in terms of usefulness. Asphalt itself functions as a temporary adhesive, bonding to automotive tires and making roads really safe by shortening stopping distances. The composite material is also flexible, allowing the ground below it to shift to an extent - which means that places with more dynamic geology.

We don't really think about wear to this surface as consumers after it's been installed. Public works / Civil Engineers sure do, because it's their job, but think about it - if you drive your car over a residential street three times a day, that's probably over 4 metric tons of material that the road has to withstand in a day. This wear adds up! A typical residential (neighborhood) street will see over 15,000 metric tons of weight per year.

The sheer scale of road wear is utterly staggering. This GAO Report on Weight Enforcement illustrates how controlling wear (usage) is a method of conveying importance, but that doesn't really work all that well for us...

Practical IT Applications

When designing technology infrastructure, especially as a service provider, you want to encourage usage.

Usage drives bigger budgets and your salary! Ultimately, wear with tech infrastructure is going to be about the same regardless of load. Scarcity economics don't work particularly well in IT.

To solve the history problem, you want to convince business line owners to desire and delight in what you provide.

The antithesis to "customer delight" in this case is often this big guy: By User:MrChrome, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=33206669

Fun fact, the Cisco 6500 is a lot older than you'd think, entering service in 1999. For more: https://en.wikipedia.org/wiki/Catalyst_6500

Cisco 6500 series switches were simply too reliable. The Toyota Camry of switches, Cisco's 6500s lived everywhere, convincing executives that it was totally okay to skip infrastructure refreshes, much to the chagrin of Infrastructure Managers worldwide.

The Solution - Messaging

We shouldn't be waiting for stuff to fail to replace it - it's time to get uncomfortable and speak to consumers. Most humans are intelligent - let's help them understand why we care about 25/100 Gigabit connectivity, cut-through switching, 802.11ax in terms that are geared towards them.

Here are some pointers on where to start:

You're not replacing something because it was bad.
- A pretty easy pitfall for IT professionals - if you devalue "what came before" you devalue the role a replacement fills. It may be hard to do, but most things here were built for a reason - the intent behind the design is important for other reasons, but this negativity will affect anything you do after that.
Show how they can use it
- This might not make a lot of sense at the outset, but any trivial method for interaction will make a particular change feel more concrete. Some examples:
  - Add a Looking Glass view if it's a new network. Providing users a way to "peek inside" is a time-honored tradition with many industries.
  - Open some iPerf/Spirent servers for users to interact with, or other benchmarking
  - Functional demos like blocking internetbadguys.com
Share how it is made
- You never know, why not try?

Ego

This one's a bit harder - and I'm not trying to apply major negative connotations here. As engineers, we get pretty attached to our decisions, attributing significant personal effort to the products we purchase.

As an industry, IT professionals really need to re-align here. We consider vendor relationships allegiances and fundamentally attribute our own personal integrity. If I had my way, I'd stop hearing that someone's a "Cisco" or a "VMware" guy - we need to shift this focus back to consumers.

The biggest point for improvement here is also on the negativity front. Let's start by shifting from "this solution is bad" (devaluing your own work for no reason) to "This solution doesn't fit our needs, and this is why." The latter helps improve future results by getting the ball rolling on what criteria consumers value more.

After deploying quite a few solutions "cradle-to-grave," my personal approach here is to think of them like old cars, computers, stuff like that. I fondly remember riding around in my parents' 80's suburban, but we replaced it because it wasn't reliable enough for the weather we had to face in Rural Alaska, and it was too big.

Here are some examples of how I regard these older, later replaced solutions/products:

Cisco 6500s: Fantastically reliable, fantastic power bills, fantastic complexity to administer
Aruba 1xx series Access Points: Revolutionary access control, less than stellar radio performance
Palo Alto 2000/4000 series firewalls: Again, revolutionary approaches to network security, but not enough performance for modern businesses to function. Commit times improved greatly on later generations
TM-OS 11.x: Incredible documentation, incredible feature depth. If it's more modern than 2015, you're going to want more features

All of these served businesses well, then needed to be replaced. I see too many engineers beat themselves up when services eventually fell apart, and it's just not necessary.