Saturday, October 26, 2019

Anycast Stateless Services with NSX-T, Implementation

First off, let's cover what's been built so far:
To set up an anycast vIP in NSX-T after standing up your base infrastructure (already depicted and configured), all you have to do is stand up a load balanced vIP at multiple sites. NSX-T takes care of the rest. Here's how:
Create a new load balancing pool.

Create a new load balancer:
Create a new virtual server:
If your Tier-1 gateways have the following configured, you should see a new /32 in your routing table:
Repeat the process for creating a new load balancer and virtual server on your second Tier-1 interface, pinned to a completely separate Tier-0. If multipath is enabled, you should see entries like this in your routing table:

It really is that easy. This process can be repeated for load balancers, and (when eventually supported) multisite network segments.

A few caveats:

  • State isn't carried through: if you're using a stateful service, use your routing protocols (AS-PATH is an easy one) to ensure that devices consistently forward to the same load balancer
  • Anycast isn't load balancing: This is easy here, as NSX-T can do both. This won't protect your servers from overload unless you use one.
  • Use the same server pool: It was (hopefully) apparent that I used the same pool everywhere. Try to keep regional configurations consistent, to ensure that new additions aren't missed for a pool. Server pools should be configured on a per region or per transport zone basis.
Some additional light reading on anycast implementations:

Saturday, October 19, 2019

Anycast Stateless Services with NSX-T, the Theory

Before getting started, let's cover what different IP message types exist in a brief summary, coupled with a "day in the life of a datagram" as it were.

One source, one well-defined destination. Most network traffic falls into this category.

Mayfly perspective:
Source device originates packet, and fires it to whatever route (yes, hosts, VMs and containers can have a routing table) matches based on the destination.
The destination router, if reachable, forwards the packet, and decrements the time-to-live (TTL) field by 1. Rinse and repeat until the destination is reached. Note: the TTL field is 8 bits, so if a message needs over 255 hops, it won't make it. (we're looking at YOU, Mars!) Pretty boring, but boring is good. 

One source, many specific destinations. This has a moderate gain in efficiency over bandwidth constrained links when routed.

In most cases, if a group pruning protocol, e.g. IGMP, MLD, is not running, multicast traffic "floods" and distributes all messages across all ports. The most common application for multicast is as a discovery or routing protocol.

Mayfly perspective:
Source device originates packet and the next layer 2 device replicates the packet to all multicast destinations (if IGMP/MLD is not doing its job, this becomes a flood, and forwards on all ports, which removes the forwarding efficiency) and then stops.
If multicast routing is enabled, traffic will forward just like it did with unicast, and have a moderate increase in efficiency. This is at the expense of traffic control. Since all multicast traffic is inherently stateless, there's no way to manage bandwidth consumption, fully eliminating the efficiency gain in many cases. If you're running routed multicast, I'd highly recommend using BGP to prune the multicast table... to help with some of this.

One source, ALL destinations. This is usually the least efficient traffic type and is part of why most networks don't have one all-encompassing VLAN, but instead use a number of subnetworks. With some exceptions, this traffic type is exclusively for when a source doesn't know how to get to a destination, e.g. ARP.

Mayfly Perspective:
Source device originates packet and the next layer 2 device floods on all ports but the origin (unless it's a hub). This traffic is subsequently dropped by all layer 3 forwarding devices unless a broadcast helper address is configured.

Unicast with a twist. Addresses (or networks) are advertised by multiple nodes, all capable of providing a service, enabling an end device to speak to the nearest available node.

Mayfly Perspective:
Source device originates packet and forwards on the appropriate interface leverages whatever routing metrics will choose. Next Layer 3 device will forward traffic to the available node with the most favorable routing protocol metric. 
There's a lot to unpack here. Let's focus on the main points re: Anycast:
  • It DOES forward to the nearest available node, and if configured correctly, will use less reachable nodes as a backup.
  • It DOES NOT load balance traffic in any meaningful way.
  • It DOES NOT retain state
This is a pretty big deal-breaker, but let's keep in mind that we have more tools - these incapabilities are completely achievable. The only things you need to provide to make a anycast service are:
  • A load balancer
  • A load balancer that provides stateful services, or one that will synchronize state.
  • A load balancer
NSX-T conveniently provides the above with fully integrated routing and switching (We set up BGP, the routing protocol of the internet before), and adds micro-segmentation firewalling to boot. I'll cover more of that on the next post.

Before we go much further, this is a critically important that we understand something very fundamental. 


I know it sounds dramatic, but VMWare's concept of a "transport zone" seems to imply that universal reachability via a PORTABLE SUBNET is the primary goal. In NSX-V, this was described as a Universal Distributed Logical Router (UDLR), and does not appear to be fully implemented in NSX-T. As a network designer, we should plan for universal reachability leveraging the Anycast model, e.g. "Will the nearest NSX-T Edge please stand up" wherever possible. 

Hopefully, it is clear by now, but Anycast isn't a specific IP message type, but instead a design for network reachability. It's commonly Unicast, but can be multicast if an implementation is carefully designed. The core principle for Anycast is to provide the shortest path to an asset, to the best knowledge of the network routing protocol.

More on the practical side of this post, but common Anycast applications include:
  • DNS
  • Application load balancers
  • Content Delivery Networks (CDNs)
Coming soon - how to do this with NSX-T!

Saturday, October 12, 2019

BGP Graceful Restart, some inter-platform oddities, what to do with it

Since most of NSX-T runs in a firewall mode of sorts, it's probably worthwhile to discuss on of the less well-known routing protocol features - Graceful Restart.

As published for BGP, IETF RFC 4724 outlines a mechanism for "preserving forwarding traffic during a BGP restart." This definition may be a little misleading, but that's mostly because of HOW the industry is leveraging Graceful Restart. Here are a few of the "normal use-cases" for BGP GR:

Cisco Non-Stop Forwarding and other similar technologies:
Cisco has developed another standard - NSF - that applies industry-generic methods for executing a BGP restart with forwarding continuity, with a twist. In many cases, multi-supervisor redundancy is a popular way of keeping your high-availability numbers up, with either a chassis switch running multiple supervisor modules or multiple devices bonded into a virtual chassis. In theory, these implementations get better availability numbers because they'll keep the main IP address reachable during software upgrades or system failures.
In my experience, this is great in campus applications, where client devices don't really have any routing/switching capability (like a cell phone) and where availability expectations are somewhat low (99%-99.99% uptime). However, in higher availability situations or ones running extensive routing protocol functionality, this appears to fall apart somewhat, where the caveats start to break the paradigm:

  • ISSU caveats: You have to CONSTANTLY upgrade your routers because ISSU is typically only supported across 1 or 2 minor releases. If you have a "cold" cutover, i.e. with a major version upgrade, you'll see a pretty extensive outage (5-30 minutes long depending on hardware)
  • Older implementations of a multi-supervisor chassis tend to have configuration sync issues, you need to CONSTANTLY test your failover capability (I mean, you should do that anyway...)

Just my 2 cents.  But here's where Graceful Restart does its job: During a supervisor failover, the IP address of the routing protocol speaker is shared between supervisors, so when establishing a routing protocol adjacency, the speakers negotiate GR capability, along with tunable timers. Since the IP doesn't change, the greatest availability action would be to continue forwarding to a "dead" address until the adjacency is established, ensuring sub-second availability for a dynamic routing protocol speaker (except in the case of updating your gear...)
Most firewall implementations are either Active-Active or Active-Standby, with shared IP addresses and session state tables. Well-designed firewall platforms use a generic method for sharing the state table, which includes (ideally) the session table, routing table, etc. ensuring that mismatched software versions do not introduce a disproportionate outage. The primary downside to this approach is that you don't have a good way to test your forwarding path (beyond Layer 2) so you should TEST OFTEN.

Now let's cover where you should NOT use Graceful Restart:
Any situation where the routing protocol speaker does not have a backup supervisor or any state mechanism. Easy, right?

NOPE. You have to enable Graceful Restart on speakers that have an adjacent firewall (or NSX-T Tier-0 gateway) to support the downstream failover.

RFC 4724 outlines two modes for Graceful Restart: Capable and Aware. Intuitively, GR Capable speakers should be stateful network devices, such as multi-supervisor chassis, firewalls, or NSX-T edges, and GR Aware devices should be stateless network devices, such as layer 3 switches.
The catch, however, is that not all devices support GR Awareness mode. For example, it IS supported in IOS 12, but provides caveats on what hardware has this capability.

So why does this matter? Well, Cisco illustrated it well in this NANOG presentation by stating that if an NSF-Capable advertising device fails, but there is no backup device sharing that same IP address, all traffic is dropped until the GR timers expire. Ouch. This is especially bad given some defaults:

  • RFC 8538 Recommendation: 180 seconds
  • Palo Alto: 120 seconds
  • Cisco: 240-300 seconds
  • VMWare NSX-T: 600 seconds?!?!?!?

Now that's pretty weird. If we fetch from VMWare's VVD 5.0.1, it says the following:
NSXT-VISDN-038 Do not enable Graceful Restart between BGP neighbors. Avoids loss of traffic. Graceful Restart maintains the forwarding table which in turn will forward packets to a down neighbor even after the BGP timers have expired causing loss of traffic. 
Coupled with the recommendation for Tier-0 to be active-active (remember, as I stated before, stateless devices do NOT need GR):

Oddly, it did not warn me about needing to restart the session. Let's find out why:

bgp-rrc-l0#show ip bgp summary
BGP router identifier, local AS number 65000
BGP table version is 84, main routing table version 84
7 network entries using 819 bytes of memory
11 path entries using 572 bytes of memory
14/6 BGP path/bestpath attribute entries using 1960 bytes of memory
2 BGP AS-PATH entries using 48 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 3399 total bytes of memory
BGP activity 102/93 prefixes, 264/247 paths, scan interval 60 secs

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd      4 65000  143031  142962       84    0    0 14w1d           2      4 65000  143036  142962       84    0    0 14w1d           1       4 64900  330104  280526       84    0    0 1d17h           1      4 65001  178250  174230       84    0    0 1w0d            3
FD00:6::240     4 65000  310833  578924       84    0    0 14w1d           0
FD00:6::241     4 65000  301493  578924       84    0    0 14w1d           1

Note that for GR to be modified, the BGP session must re-start, so if this was a production environment with equipment that supports GR (*sigh*) you would want to get into the leaf switch and perform a hard restart of the BGP peering.

VMWare's VVD recommendation here is pretty sound, as with most devices the GR checkbox is a global one, so you'd want to buffer between GR/Non-GR with a dedicated router (it's just a VM in NSX's case!), keeping in mind most leaf switches will have GR enabled by default.

Oddly enough, Cisco's Nexus 9000 platform (flagship datacenter switches) default to graceful restart capable. My recommendations (to pile on with the VVD) on this platform would be to:

  • Set BGP timers to 4/12
  • Set GR timers to 120/120 or lower (they're fast switches, so I chose 30/30)
  • Under BGP, configure graceful-restart-helper to make the device GR-Aware instead of GR-Capable
Obviously, the VVD will adequately protect your infrastructure to issues like this, but I think it's unlikely you'll have NSX-T as the only firewall in your entire datacenter.

Saturday, October 5, 2019

NSX-T 2.5 Getting Started, Part 2 - Service Configuration!

Now that the primary infrastructure components for NSX-T are in place, it is now possible to build-out the actual functions that NSX-T is designed to provide.

A friendly suggestion, make sure your Fabric is healthy before doing this:
NSX-T differs from NSX-V quite a bit here. Irregular topologies between edge routers aren't supported, and you have to design any virtual network deployments in a two-tier topology that somewhat resembles Cisco's Aggregation-Access model, but in REVERSE.

The top tier of this network, or as VMWare calls it in their design guide, Tier-0, the primary function provided by logical routers in this layer are simply route aggregation devices, performing tasks such as:
  • Firewalling
  • Dynamic Routing to Physical Network
  • Route Summarization
  • ECMP
The second logical tier, Tier-1 is automatically and dynamically connected to Tier-0 routers via /31s generated from a prefix of your choosing. This logical router will experience a much higher frequency of change, performing tasks like:
  • Layer 2 segment termination/default gateway
  • Load Balancing
  • Firewalling
  • VPN Termination
  • NAT
  • Policy-Based Forwarding
Before implementing said network design, I prefer to write out a network diagram.

Let's start with configuring the Tier-0 gateway:
We'll configure the Tier-0 router to redistribute pretty much everything:
Configure the uplink interface:
Oddly enough, we have spotted a new addition with 2.5 in the wild - the automatic inclusion of prefix-lists!
We also want to configure route summarization, as the switches in my lab are pretty ancient (WS-3560-24TS-E). I'd recommend doing this anyway in production, as it will limit the impact of widespread changes. To pull that off, you *should* reserve the following prefixes, even if they seem excessive:
  • A /16 for Virtual Network Services per transport zone
  • A /16 for NSX-T Internals, allocating /19s to each tier-0 cluster, as outlined in our diagram.
I did so below, and it makes route aggregation or summarization EASY.
Now, we configure BGP Neighbors:
At this point, we want to save and test the configuration. It'll take a while for NSX-T to provision the services listed here, but once it's up, you'll see:
Check for advertised routes. Only routes that exist are aggregated, so you should only see

As a downside, I have prefix-filtering to prevent my lab from stomping on the vital pinterest and netfix network, so I had to add the new prefixes to that:
That was quite a journey! Fortunately, Tier-1 gateway configuration is MUCH simpler, initially. Most of the work performed on a Tier-1 Gateway is Day 1/Day 2, where you add/remove network entities as you need them:
Let's add a segment to test advertisements. I STRONGLY RECOMMEND WRITING A NAMING CONVENTION HERE. This is one big difference between NSX-V and NSX-T, where you don't have this massive UUID in the port group obfuscating what you have. Name this something obvious and readable, your future self will thank you.
Hey look, new routes!

As I previously mentioned, these segments, once provisioned, are just available as port-groups for consumption by other VMs on any NSX prepared host:
Next, we'll configure NSX-T to make waffles!

Popular Posts