Sunday, May 22, 2022

Scale datacenters past the number of VLAN IDs with NSX-T Tier-0 and Q-in-X

VMware introduced the ability to double-encapsulate layer 2 frames in via the "Access VLAN" option for VRF instances in NSX Data Center:

Q-in-VNI provides a capable infrastructure engineer the ability to to construct straightforward multitenant constructs. From the documentation and previous testing, we have demonstrated its capability outside of Layer 3 constructs. The objective of this post is to examine and test these capabilities with Tier-0 VRFs:

NSX Data Center provides the ability to pass a tag inside of a segment, which enables a few interesting design patterns:

  • Layer 3 VPN to customer's campus, with each 802.1q tag delineating a separate "tenant", e.g. PCI/Non-PCI
  • Inserting carrier workloads selectively to specific networks
  • Customer empowerment - let's enable the customer to use their cloud how they please

To validate this hypothesis, we will leverage the following isolated topology:

Note: VRF-Lite is required for this feature!

Q-in-VNI on NSX-T Routers

When configuring an interface on a VRF, the following option (Access VLAN ID) becomes available. Select the appropriate "inside" VLAN for each sub-interface:

We then configure the sub-interfaces - the tenant VM is unaware that it's being wrapped into an overlay:

Unsurprisingly, this feature just works. NSX-T is designed to provide a multi-tenant cloud-like environment, and VLAN caps are a huge problem in that space. In this example, we created 2 subinterfaces in the same VRF - normally tenants would not share a VLAN.

Q-in-VNI Design Patterns

Offering Q-in-VNI on a Tier-0 solves valuable use cases for multi-tenant platform services. The primary focus of these solultions is customer empowerment - VMware isn't taking sides on matters of :"vi vs emacs", "Juniper vs Cisco", etc. Instead, we as CSPs can provide a few design patterns that enable a customer to leverage their own chosen methods, or even to allow an ISP to integrate crisply and effectively with their telecommunications services.

NSX-T has some fairly small scalability limits for CSPs leveraging the default recommended design pattern (160 standalone Tier-0s), and the ultimate best solution is to leverage multiple NSX Data Center instances to accommodate. If the desired number of tenants is above, say, twice that, the VRF-Lite feature allows an infrastructure engineer to deploy 100 routing tables per Tier-0. 

VRF-Lite enables scaling to 4,000 Tier-1 gateways at this level, and a highly theoretical maximum of 160,000, but the primary advantage of this approach is that customers can bring their own networking easily and smoothly, front-ending NSX components with their preferred Network OS. Customers and Infrastructure engineers extend the feature set and reducing strain on NSX at the same time, creating a scenario where both the customer and the infrastructure benefit cooperatively.

Note: VMware's current configuration maximums are provided here:>

VRF-Lite can also be built to provide a solution where customers can "hair-pin" their tenant routing tables to a virtual firewall over the same VN-Segment. Enterprise teams leveraging NSX Data Center benefit the most from this approach, because common virtual firewall deployments are limited by the number of interfaces available on a VM. This design pattern empowers customers by permitting infrastructure engineers to construct thousands of macrosegmentation zones if desired.

Q-in-Q on NSX-T Routers

Time to test out the more complex option!

When I attempt to configure an internal tag with VRF-Lite subinterfaces, the following error is displayed:

Sadly, it appears that Q-in-Q is not supported yet, only Q-in-VNI. Perhaps this feature will be provided at a later date.

Here's the VyOS configuration to perform Q-in-Q:


  • Learn, hypothesize, test is an important cycle for learning and design, and this is why we build home labs. NSX Data Center appeared to support Q-in-Q tagging - but the feature was ultimately for passing a trunk directly to a specific VLAN ID in a port-group.
  • vSphere vDS does not appear to allow Q-in-Q to trunk outwards to other port-groups that do not support VLAN trunking, either.
  • Make sure that MTU can hold inner and outer header without loss. I set the MTU to 1700, but you only need 16 bytes of extra MTU for the 802.1q header.

Friday, May 6, 2022

Different Methods to carry 802.1q tags with VMware vDS and NSX-T

 VMware's vDS is a bit of a misnomer

In a previous post, I covered the concept of transitivity in networking - but in Layer 2 (Ethernet) land, transitivity is critically important to understanding how VMware's Virtual Distributed Switch (vDS) works.

The statement "VMware's Virtual Distributed Switch is not a switch" seems controversial, but let's take a moment to reflect - when you plug in the second uplink on an ESXi host, does the ESXi host participate in spanning tree?

Testing this concept at a basic level is straightforward. Enabling BPDU Guard on an ESXi host-facing port should take the host down immediately if it's actually a switch (it doesn't). This concept is actually quite useful to a capable infrastructure engineer.

A "Layer 2 Proxy"

VMware's vDS is quite a bit more useful than a simple Layer 2 transitive network device - each ESXi host accepts data from a virtual machine, and then leverages a "host proxy switch" to take each packet and re-write its Layer 2 header in a 3-stage process:

Note: For a more detailed explanation of VMware's vDS architecture and how it's implemented, VMware's documentation is here

Note: VMware's naming for network interfaces can be a little confusing, here's a cheat sheet:

  • vnic: A workload's network adapter
  • vmnic: A hypervisor's uplink
  • vmknic: A hypervisor's Layer 3 adapter

A common misinterpretation of vDS is that the VLAN ID assigned to a virtual machine is some form of stored variable in vSphere - it isn't. vDS was designed with applying network policy in mind - and an 802.1q tag is simply another policy.

vDS is designed with tenancy considerations, so a port-group will not be allowed to transit traffic between different port-groups (but the same VLAN ID). Non-transitive behaviors achieve two goals at the same time - providing an infrastructure engineer total control of data egress on a vSphere host, and adequate segmentation to build a multi-tenant VMware Cloud.

Replacing the Layer 2 header on workload packets is extremely powerful - vDS essentially empowers an infrastructure engineer to write policy and change packet behavior. Here are some examples:

  • For a VM's attached vnic, apply an 802.1q tag (or don't!)
  • For a VM's attached vnic, limit traffic to 10 Megabits/s
  • For a VM's attached vnic, attach a DSCP tag
  • For a VM's attached vnic, deny promiscuous mode/MAC spoofing
  • For a VM's attached vnic, prefer specific vmnics
  • For a VM's attached vnic, export IPFix

NSX expands on this capability quite a bit by adding overlay network functions:

  • For a VM's attached vnic, publish the MAC to the global controller table (if it isn't already there) and send the data over a GENEVE or VXLAN tunnel
  • For a VM's attached vnic, only allow speakers with valid ARP and MAC entries (validated via VMware tools or Trust-on-First-Use) to speak on a given segment
  • For a VM's attached vnic,send traffic to the appropriate distributed or service router

NSX also enables a few things for NFV that are incredibly useful, NFV service chaining and Q-in-VNI encapsulation.

Q-in-VNI encapsulation is pretty neat - it allows an "inside" Virtual Network Function (VNF) to have total autonomy with inner 802.1q tags, empowering an infrastructure engineer to create a topology (with segments) and deliver complete control to the consumer of that app. Here's an example packet running inside a Q-in-VNI enabled segment (howto is here).

NSX Data Center is not just for virtualizing the data center anymore. This capability, combined with the other precursors (generating network configurations with a CI tool, automatically deploying changes, virtualization), is the future of reliable enterprise networking

Friday, April 29, 2022

Network Experiments with VMware NSX-T and Cisco Modeling Labs

Cisco Modeling Labs (CML) has turned out to be a great tool for deploying virtual network resources, but the "only Cisco VNFs" limitation is a bit much.

Let's use this opportunity to really take advantage of the capabilities that NSX-T has for virtual network labs!


For the purpose of lab construction, I will use an existing Tier-0 router and uplinks to facilitate the "basics", e.g. internet connectivity && remote accessibility:

Constructing the NSX-T "Outside" Environment

NSX-T has a few super-powers when it comes to constructing network topologies. We need one in particular to execute this - Q-in-VNI encapsulation - which is enabled on a per-segment basis. Compared to NSX-V, this is simple and straightforward - a user simply applies the allowed VLAN range to the vn-segment, in this case, eng-lab-vn-cml-trunk: 

We'll need to disable some security features that protect the network by creating and applying the following segment profiles. The objective is to limit ARP snooping and force NSX to learn MAC addresses over the segment instead of from the controller.

I can't stress this enough, the Q-in-VNI feature gives us near unlimited power with other VFs, circumventing the 10 vNIC limit in vSphere and reducing the number of segments that must be deployed to what would normally be "physical pipes".

Here's what it looks like via the API:

Expose connectivity to CML

Generating the segments in NSX is complete, and now we need to get into the weeds a bit with CML. CML is, in essence, a Linux machine running containers, and has external interface support. Let's add the segments to the appropriate VM:

CML provides a Cockpit UI to make this easy, but the usual suspects exist to create a Linux bridge. We're going to create both segments as bridge adapters:

A word of warning - this is a Linux Software Bridge, and will copy everything flowing through the port. NSX helps by squelching a significant chunk of the "network storms" that result, but I would recommend not putting it on the same NSX manager or hosts as your production environments.

Leveraging CML connectivity in a Lab

The hard part's done! To consume this new network, add a node of type External Connector, and configure it to leverage the bridge we just created:

The next step would be to deploy resources and configure them to connect to this central point - I'd recommend this approach to make nodes reachable via their management interface, i.e. deploy a management VRF and connect all nodes to it for stuff like Ansible testing. CML has an API, so the end goal here would be to spin up a test topology and test code against it, or to build a mini service provider.

Lessons Learned

This is a network lab, and we're taking some of the guardrails off when building something this free-form. NSX does allow an engineer to do so on a per-segment basis, and insulates the "outside" network from any weird stuff you intend to do. The inherent dangers to spamming BPDUs outwards from a host or packet flooding can be contained with the built-in feature "segment security profiles", indicating that VMware had clear intent to support similar use-cases in the future.

NSX also enables a few other functions if in routed mode with a Tier-1. It's trivial to set up NAT policies, redistribute routes as Tier-1 statics, to control export of any internal connectivity you may or may not want touching the "real network".

I do think that this combination is a really impressive one-two punch for supporting enterprise deployments. If you ask anyone but Cisco, a typical enterprise network environment will involve many different network vendors, and having the flexibility to mix and match in a "testing environment" is a luxury that most infrastructure engineers don't have, but should.

CML Scenario

Here's the CML scenario I deployed to test this functionality:

Sunday, April 17, 2022

Vendor interoperability with multiple STP instances

Spanning Tree is the all-important loop prevention method for Layer 2 topologies and source of ire to network engineers worldwide.

Usually IT engineers list the Dunning-Kruger Effect in a negative context, depicting an oblivious junior or an unaware manager, but I like to focus on the opposite end of the curve with meta-cognition. Striving to developing meta-cognition and developing self-awareness is difficult and competes with the ego, but is an incredibly powerful tool for learning. I cannot stress enough how important getting comfortable with one's limitations is to a career.

Spanning Tree is a key topic that should be revisited frequently, building upon knowledge growth.

Let's examine some methods for plural instantiation with Spanning Tree.

The OSI Model and sublayers

Ethernet by itself is surprisingly limited in scope for the utility we glean from it. It provides:

  • Source/Destination forwarding
  • Variable length payloads
  • Segmentation with 802.1q (VLAN tagging) or 802.3ad (Q-in-Q tagging)

It also doesn't provide a Time-to-Live field at all, which is why switching loops are so critically dangerous in production networks.

Ethernet needs a supporting upper layer (the Logical Link Control sublayer) to relay instructions on how to handle packets. Internal to a switch ASIC, the hardware itself needs some form of indicator to select pipelines for forwarding or processing.  

802.2 LLC and Subnetwork Access Protocol (SNAP) are typically used in concert to bridge this gap and allow a forwarding plane to classify frames and send to the appropriate pipeline. Examples are provided with this article, where LLC and SNAP are used to say "this is a control plane packet, don't copy it, process it" or "this packet is for hosts, copy and forward it".

Multiple Ethernet Versions

Over the years, network vendors implemented multiple versions of the Lower Ethernet sublayer, and in many cases did not update the Control Plane code on network equipment.  manage to proliferate throughout computer networks for , which appear to resemble simply stitching together ancient PDU types for compatibility . It's not entirely surprising that multiple ethernet editions exist given the fragmentation throughout the industry.

I'd strongly recommend reading this study by Muhammad Farooq-i-Azam from the COMSATS Institute of Information Technology. The author outlines methods of testing common forms of Ethernet in production formats, and provides a detailed overview of our progress to standardization. Spanning Tree is a major cause for the remaining consolidation work, as it turns out.

Generally, Ethernet II is what you want to see in a production network, and most host frames will follow this standard. Variable-length fields over 1536 bytes are supported by this protocol, which is a big advantage in data centers.

The original Ethernet standard, 802.3 did not support ethertypes or frames larger than 1536 bytes, and is typically used by legacy code in a switching control plane. Two major variants are used within this protocol, extended by LLC and (optionally) with SNAP as an extension.

So, why does this matter when talking about spanning tree? Network Equipment Providers (NEP) haven't all updated their Layer 2 control plane protocols in a long time. Bridge Protocol Data Units (BPDUs) are inconsistently (consistently in their hardware, but inconsistent with others) transmitted, causing a wide variety of interoperability issues.

Per-VLAN Spanning Tree

From a protocol standpoint, Per-VLAN STP and RSTP are probably the simplest method with the fewest design implications and most intuitive protocol layout - but some dangers are inherent when running multi-vendor networks. 

Let's examine a few captured packets containing the STP control plane

Cisco PVRST+ Dissection

Cisco structures the per-VLAN control plane by wrapping instantiated BPDUs:

  • With an Ethernet II Layer 2 Lower
  • With an 802.1q tag encapsulating the VLAN ID
  • With a SNAP Layer 2 Upper, PID of PVSTP+ (0x010b)
  • Destination of 0100.0ccc.cccd

Arista and Cisco Vendor Interoperability

Interestingly enough, I discovered that Arista appears to implement compatible PVRST to Cisco (with some adjustments covered in Arista's whitepaper). To validate this, I executed another packet dissection with Arista's vEOS, which is available to the public. I have provided the results here, but the PDUs are nearly identical to the Cisco implementation.

MST Dissection

For the majority of vendor interoperable spanning-tree implementations, this will be a network engineer's best option. MST allows an engineer to specify up to 16 separate instances of Spanning Tree, either 802.1d or 802.1w. The primary hazards with leveraging MST have a great deal to do with trunking edge ports, as each topology must be accounted for and carefully planned. BPDU Guard, Loop Guard, and VLAN pruning are absolutely necessary when planning MST, in addition to diagramming each topology that will be instantiated.

IEEE's MST standard is implemented per-instance, and relayed with one common BPDU. It's pleasingly efficient, but...

  • With an 802.3 Ethernet Layer 2 Lower
  • With no 802.1q tag 
  • With an LLC header, ID of BPDU (0x42)
  • With a destination MAC 0180.c200.0000
  • After the BPDU typical attributes, MST attaches an extension indicating priorities, but no unique bridge IDs. If a topology differs, it would be separated onto its own control plane frame.


CST Dissection

For comparison, I also compared against a Mikrotik Routerboard, which should follow the implements RSTP as a single instance (Common Spanning Tree) and optionally supports Multiple Spanning-Tree (MST). I found the following attributes with default settings:

  • Destination of 0180.c200.0000 (STP All-Bridges Destination Address)
  •  802.3 Ethernet Frame
  • Spanning-tree BPDU Type


Spanning Tree is a foundation to all enterprise networks, but there really seems to be some legacy code and general weirdness in place here. The industry, starting with the data center, is moving to a more deterministic control plane to replace it, whether that be EVPN or a controller-based model like NSX. 

Campus enterprise deployments are beginning to do the same as businesses realize that a shared, multi-tenant campus network can increase the value against the cost of the same equipment. With the downsizing of corporate offices, the only thing stopping office providers from also providing a consolidated campus network is a general lack of expertise and an industry full of under-developed solutions. As platforms converge, the same pattern will emerge in campus deployments soon.

In terms of design implications, supplanting Layer 2 legacy control planes is no small feat. Even EVPN requires STP at the edge, but containment and exceptions management are both clear design decisions to make when building enterprise networks.

Saturday, March 26, 2022

Cisco Modeling Labs

Ever wonder what it would be like to have a platform dedicated to Continuous Improvement / Testing / Labbing?

Cisco's put a lot of thought into good ways to do that, and Cisco Modeling Labs(CML) is the latest iteration of their solutions to provide this as a service to enterprises and casual users alike.
CML is the next logical iteration of Virtual Internet Routing Labs (VIRL) and is officially backed by Cisco with legal VNF licensing. It's a KVM type-2 hypervisor, and automatically handles VNF wiring with an HTML5/REST interface.

Cisco's mission is to provide a NetDevOps platform to network engineers, upgrading the industry's skillset to provide an entirely new level of reliability to infrastructure. Hardware improves over time, and refresh cycles complete - transferring the "downtime" problem from hardware/transceiver failure to engineer mistakes. NetDevOps is the antidote for this problem, infrastructure engineers should leverage automation to make any operation done on production equipment absolutely safe.

Solution Overview

Cisco Modeling Labs provides you the ability to:

  • Provision up to 20/40 Cisco NOS nodes (personal licensing) as you see fit
  • Execute "What if?" scenarios without having to pre-provision or purchase network hardware, improving service reliability
  • Develop IaC tools, Ansible playbooks and other automation on systems other than the production network
  • Leverage TRex (Cisco's network traffic generator) to make simulations more real
  • Deploy workloads to the CML fabric to take a closer look or add capabilities inside
  • Save and share labbed topologies
  • Do everything via the API. Cisco DevNet even supplies a Python client for CML to make the API adoption easy!

Some important considerations for CML:

  • Set up a new segment for management interfaces, ensuring that external CI Tooling/Ansible/etc can reach it.
  • VNFs are hungry. NX-OSv images are at the high end (8GB of memory each), and IOSv/CSR1000v will monopolize CPU. Make sure that plenty of resources are allocated
  • Leverages Cisco Smart Licensing. CML uses legitimate VNF images with legitimate licensing, but will need internet access 
  • CML does not provide SD-WAN features, Wireless, or Firepower appliance licensing, but does support deploying them

Let's Install CML! Cisco provides an ESXi (vSphere) compatible OVA:

After It's deployed, I assigned 4 vCPUs and 24 GB of memory, and attached the platform ISO. Search under for Modeling Labs:

Once mounted, the wizard continues from there. CML will ask you for passwords, IP configurations, the typical accoutrements:

The installer will take about 10 minutes to run, as it copies all of the base images into your virtual machine. Once it boots up, CML has two interfaces:

  • https://{{ ip }}/ : The CML "Lab" Interface
  • https://{{ ip }}:9090/ : The Ubuntu "cockpit" view, manage the appliance and its software updates from here. Cisco's integrated CML into this GUI as well.

CML's primary interface will not allow workloads to be built until a license is installed. Licenses are fetched from the Cisco Learning Network Store under settings:

CML is ready to go! Happy Labbing!

Saturday, March 19, 2022

Deploy Root Certificates to Debian-based Linux systems with Ansible

 There are numerous advantages to deploying an internal root CA to an enterprise:

  • Autonomy: Enterprises can control how their certificates are issued, structured, and revoked independently of a third party.
    • Slow or fast replacement cycles are permissible if you control the infrastructure, letting you customize the CA to the business needs
    • Want to set rules for what asymmetric cryptography to use? Don't like SHA1? You're in control!
  • Cost: Services like Let's Encrypt break this a bit, but require a publicly auditable service. Most paid CAs charge per-certificate, which can really add up
  • Better than self-signed: Training users to ignore certificate errors is extremely poor cyber hygiene, leaving your users vulnerable to all kinds of problems
  • Multi-purpose: Certificates can be used for users, services, email encryption, getting rid of passwords. They're not just to authenticate web servers.

 The only major obstacle to internal CAs happens to be a pretty old one - finding a scalable way to deliver the root Certificate Authority to appropriate "trust stores" (they do exactly what it sounds like they do) on all managed systems. Here are a few "hot-spots" that I've found over the years, ordered from high-value, low effort to low-value, high effort. They're all worthwhile, so please consider it an "order of operations" and not an elimination list:

  • Windows Certificate Store: With Microsoft Windows' SChannel library, just about everything on the system will install a certificate in one move. I'm not a Windows expert, but this delivery is always the most valuable up-front.
  • Linux Trust Store: Linux provides a trust store in different locations depending on distribution base.
  • Firefox: Mozilla's NSS will store independently from Windows or Linux, and will need to be automated independently.
  • Java Trust Stores are also independently held and specific to deployed version. This will require extensive deployment automation (do it on install, and do it once).
  • Python also has a self-deployed trust store when using libraries like requests, but Debian/Ubuntu specific packages are tweaked to use the system. There are a ton of tweaks to just make it use the system store, but the easiest is to leverage REQUESTS_CA_BUNDLE as an environment variable pointing to your system trust store.

Hopefully it's pretty clear that automation is about to become your new best friend when it comes to internal CA administration. Let's outline how we'd want to tackle the Linux aspects of this problem:

  • Pick up the root certificate, and deliver from the Controller to the managed node
    • Either Git or an Artifacts store would be adequate for publishing a root certificate for delivery. For simplicity's sake, I'll be adding it to the Git repository.
    • Ansible's copy module enables us to easily complete this task, and is idempotent.
  • Install any software packages necessary to import certificates into the trust store
    • Ansible's apt module enables us to easily complete this task, and is idempotent.
  • Install the certificate into a system's trust store
    • Locations differ based on distribution. Some handling to detect operating system and act accordingly will be worthwhile in mixed environment
    • Ansible's shell module can be used, but only as a fallback. It's not idempotent, and can be distribution-specific.
  • Restart any necessary services

Here's where the beauty of idempotency really starts to shine. With Ansible, it's possible to just set a schedule for the playbook to execute in a CI tool like Jenkins. CI tools add some neat features here, like only executing on a source control change, which may not apply when using an artifacts store to deploy the root certificate.

In this example, I will be adding the play to my nightly update playbook to illustrate how easy this is:

After completion, this action can be tested by a wide variety of means - my favorite would be cURLing a web service that leverages the root CA:


Saturday, March 12, 2022

Cloud-Scale Networking: NSX Datacenter Hierarchical Tier-0s, blending telecom with cloud

 VMware's NSX Datacenter product is designed for a bit more than single enterprise virtual networking and security.

When reviewing platform maximums (NSX-T 3.2 ConfigMax), the listed maximum number of Tier-1 routers is 4,000 logical routers. Achieving that number takes a degree of intentional design, however.

When building a multi-tenant cloud network leveraging NSX Data Center, the primary design elements are straightforward: 

  • Shared Services Tenant
    • A multi-tenant data center will always have common services like outbound connectivity, orchestration tooling, object storage, DNS.
    • This tenant is commonly offered as a component to a physical fabric with a dedicated Workload Domain (WLD), but can be fully virtualized and run on the commodity compute
    • Packaging shared services within a WLD will require repetitive instantiation of common services, but makes the service "anycast-like" in that it will be more resilient for relatively little effort
    • Implementation with hierarchical Tier-0s is comically easy, just attach the shared Tier-1 to the Infrastructure Tier-0!
    • When designing a shared services tenant, a "Globally Routable" prefix is highly recommended in IPv4 to ensure that no conflicts occur. With IPv6, all networks should have globally routable allocations
  • Scaling: Tenants, Tenants, and more Tenants!
    • Most fabric routers have a BGP Peer cap of 256 speakers:
      • Divide that number in half for dual-stack: 128 speakers
      • Remove n spine nodes, 124 speakers
      • Add 4-way redundancy (2 Edge Transport Nodes):  31 Speakers, or 8-way, 15 Speakers
    • For customer-owned routing, the scalability maximum of 4,000 logical routers is achievable without good planning

Let's take a look at an infrastructure blueprint for scaling out network tenancy:

The more "boring" version of tenancy in this model supports highly scalable networking, where a customer owns the Tier-1 firewall and can self-service with vCloud Director:


VRF-Lite allows an NSX Engineer to support 100 network namespaces per Edge Transport Node cluster. When leveraging this feature, Tier-1 Logical Routers can connect to a non-default network namespace and peer traditionally with infrastructure that is not owned by the Cloud Provider via Layer 2 Interconnection or something more scalable (like MPLS or EVPN).

Empowering Cloud-Scale Networking with a feature like drop-shipping customer workloads into MPLS is an incredibly powerful tool, not just with scalability,but with ease of management. NSX-T VRFs can peer directly with the PE, simplifying any LDP or SR implementations required to support tenancy at scale.

With this design, we'd simply add a VRF to the "Customer Tier-0" construct, and peer BGP directly with the MPLS Provider Edge (PE). An NSX-T WLD with 3.0 or newer can support 1,600 instances this way, where most Clos fabrics can support ~4,000. The only difference here is that WLDs can be scaled horizontally with little effort, particularly when leveraging VCF.

A tenant VRF or network namespace will still receive infrastructure routes, but BGP engineering, Longest Prefix Match (LPM), AS-Path manipulation all can be used to ensure appropriate pathing for traffic to customer premise, shared infrastructure or other tenants with traditional telecommunications practices. Optionally, the customer's VRF can even override the advertised default route, steering internet-bound traffic to an external appliance.

This reference design solves another substantial maintainability problem with VRF-Lite implementations - VRF Leaking. The vast majority of hardware based routing platforms do not have a good path to traverse traffic between network namespaces, and software based routing platforms struggle with the maintainability issues associated with using the internal memory bus as a "warp dimension".

With overlay networking, this is easily controlled. VRF constructs in NSX-T inherit the parent BGP ASN, which prevents transit by default.  The deterministic method to control route propagation between VRFs with eBGP is to replace the ASN of the tenant route with its own, e.g.:

Original AS-PathNew AS-Path
64906 6490564905 64905

AS-Path rewrites provide an excellent balance between preventing transit by default and easily, safely, and maintainably providing transitive capabilities.

The more canonical, CCIE-worthy approach to solving this problem is not yet viable at scale. Inter-Tier-0 meshing of iBGP peers is the only feature available, confederations and route reflectors are not yet exposed to NSX-T from FRRouting. When this capability is included in NSX Data Center, iBGP will be the way to go.

Let's build the topology:

NSX Data Center's super-power is creating segments cheaply and easily, so a mesh of this fashion can be executed two ways.

Creating many /31s, one for each router member as overlay segments. NSX-T does this automatically for Tier-0 clustering and Tier-1 inter-links, and GENEVE IDs work to our advantage here. Meshing in this manner is best approach overall, and should be done with automation in a production scenario. I will probably write a mesh generator in a future post.

In this case, we'll do it IPv6-style, by creating a single interconnection segment and attaching a /24/64 to it. Tier-0 routers will mesh BGP with each other over this link. I'm using Ansible here to build this segment, not many of the knobs and dials are necessary so it saves both time and physical space.

    - name: "eng-lab-vn-segment-ix-"
        hostname: ""
        username: '{{ lookup("env", "APIUSER") }}'
        password: '{{ lookup("env", "APIPASS") }}'
        validate_certs: False
        state: present
        display_name: "eng-lab-vn-segment-ix-"
        transport_zone_display_name: "POTATOFABRIC-overlay-tz"
        replication_mode: "SOURCE"
        admin_state: UP
    - name: "eng-lab-vn-segment-ix-"
        hostname: ""
        username: '{{ lookup("env", "APIUSER") }}'
        password: '{{ lookup("env", "APIPASS") }}'
        validate_certs: False
        state: present
        display_name: "eng-lab-vn-segment-ix-"
        transport_zone_display_name: "POTATOFABRIC-overlay-tz"
        replication_mode: "SOURCE"
        admin_state: UP

From here, we configure the following external ports and peers on the infrastructure Tier-0:

  • eng-lab-t0r00-ix-1: (AS64905, no peer)
  • eng-lab-t0r00-ix-vrfs-1: (AS64905, no peer)
  • eng-lab-t0r00-ix-2: (AS64905, no peer)
  • eng-lab-t0r00-ix-vrfs-2: (AS64905, no peer)
  • eng-lab-t0r01-ix-1: (AS64906)
  • eng-lab-t0r10-ix-1: (AS64906)

Let's build the tenant default Tier-0:

Then configure BGP and BGP Peers:

Voila! BGP peerings are up without any VLANs whatsoever! Next, the VRF:

For the sake of brevity, I'm skipping some of the configuration after that. The notable differences are that you cannot change the BGP ASN:

That's about it! Peering between NSX-T Tier-0 routers is a snap.

The next stage to building a cloud must be automation at this point. Enabling self-service instantiating Tier-1s and/or VRFs will empower a business to onboard customers quickly, so consistency is key. Building the infrastructure is just the beginning of the journey, as always.

Lessons Learned

The Service Provider community has only just scratched the surface of what VMware's NSBU has made possible. NSX Data Center is built from the ground up to provide carrier-grade telecommunications features at scale, and blends the two "SPs" (Internet and Cloud Service Providers) into one software suite. I envision that this new form of company will become some type of "Value-Added Telecom" and take the world by the horns in the near future.

Diving deeper into NSX-T's Service Provider features is a rewarding experience. The sky is the limit! I did discover a few neat possibilities with this structure and design pattern that may be interesting (or make/break a deployment!)

  • Any part of this can be replaced with a Virtual Network Function. Customers like a preferred Network Operating System (NOS), or simply want a NGFW in-place. VMware doesn't even try to prevent this practice, enabling it as a VM or a CNF (someday). If a Service Provider has 200 Fortinet customers, 1,000 Palo Alto customers, and 400 Checkpoint customers, all of them will be happy to know they can simply drop whatever they want, wherever they want.
  • Orchestration and automation tools can build fully functional simulated networks for a quick "what if?" lab as a vApp.

Credit where credit is due

The team at 27 Virtual provided the design scenario and the community opportunity to fully realize this idea, and were extremely tolerant of me taking an exercise completely off the rails. You can see my team's development work here:

Sunday, March 6, 2022

VMware NSX-T and Ansible

What is the point of all this software-defined infrastructure if you don't use it?

In prior examples, it's a fairly straightforward path to SDN when deploying NSX Data Center, allowing a VI admin or network engineer to deploy virtual network resources via a GUI.

This isn't the end of an effort, but the start of a journey. Once the API is available, deployment of services on top of a virtual cloud network become easier


First, we need our CI tooling to be capable of leveraging VMware's NSX-T Community module:
ansible-galaxy collection install git+
Note: This requires Ansible 3.0 or higher to leverage the "Galaxy install from git+https" feature. This software package is not hosted on Ansible Galaxy

Building the playbook

To an Ansible engineer, this part might be a bit bothersome. Since we're interacting with an appliance, there are several differences compared to canonical Ansible playbooks:
  • Inventory: Playbooks for network providers specify targets inside of their module, not using Ansible's inventory. If an inventory is used, the playbook will execute once on every inventory host, targeting the same destination device - not good.
  • Credentials are stored internally to the leveraged module
  • Any parameters that would constitute the thing that the playbook should build

Now that we have the required software, the next step would be to map out what to build. Playbooks typically have documentation on what parameters it will accept to perform work. Example here.

Let's take what will probably be the most common deployment to automate - network segments:

Executing the Playbook

When writing playbooks that will be frequently re-used, I like to leverage Jinja in the playbook, denoted by {{}} to morph to whatever need I have at the moment. Ansible supports loading variables with the --extra-vars "@{{ filename }}" statement:

ansible-playbook create_segments.yml --extra-vars "@segment_vars.yml"

Credential Management

I've glossed over a particularly important aspect of automation here - what credentials do we use?
Generally, I prefer to perform "hands-free" execution of aspects like this, so the provided playbook is designed to leverage the "Credentials as Environment Variables" feature in Jenkins automatically.

What I think of the Module

ansible-for-nsxt is a community maintained module, so expectations for the automation features should be set at the price paid for the software. I've tested this platform quite a bit, and the issues that I encountered appeared to be code obfuscation.

Digging deeper into the Ansible modules themselves, the biggest reason for this is NSX-T's declarative API - it makes more sense to not code those things at all and simply leverage the API. This resonates with me quite a bit!

VMware's community also has built testing into the repository, which seems to indicate that that testing is automated!

I do have two complaints about this module:

  • No maintainer for Ansible Galaxy means that Ansible 2 users (Red Hat shops) will have a difficult time installing the software from GitHub
  • Not all modules fully support IPv6 yet

Lessons Learned

Testing Idempotency

When looking to leverage automation at work, change safety is often provided as the primary reason not to do it.

It's okay not to trust automation with your production network, that's what testing is for.

When implementing an automation play like this in a production network, we need to evaluate whether or not it's safe to execute.

The first aspect to test when planning to automate a play will be idempotency, or how executing a play should consistently cause the desired state defined in the playbook, and not to impact services unnecessarily (ex. by deleting and re-creating something).

Idempotency is extremely important with infrastructure, we can't afford downtime as easily as other IT professions. The good news is that it's pretty easy to achieve:

ansible-playbook create_segments.yml --extra-vars "@segment_vars.yml"
ansible-playbook create_segments.yml --extra-vars "@segment_vars.yml"

The second executed playbook should return a status of "ok" instead of "changed". 

If this test is passed, we'll evaluate the next aspect of idempotency by executing the playbook, then changing the segment's configuration in some way (GUI, API, Ansible playbook), and re-executing to ensure that the delta was detected by Ansible and can be resolved by it. I just create a second segment_vars_deviation.yml file and execute thusly:

ansible-playbook create_segments.yml --extra-vars "@segment_vars.yml"
ansible-playbook create_segments.yml --extra-vars "@segment_vars_deviation.yml"
ansible-playbook create_segments.yml --extra-vars "@segment_vars.yml"

This test should return "changed" for all three attempts.

More Testing

These two tests have extremely good coverage, indicating a high level of change safety for almost no effort. Additional tests to execute for a play like this can be mind-mapped or brain-stormed, and then coded from there. Here are some examples:

  • Check Tier-0 routes to ensure that the prefix built populated
  • Check Looking Glass to ensure the prefix is reachable everywhere
  • Check vCenter for the port-group created

Sunday, January 16, 2022

Bogons, and how to leverage public IP feeds with NSX-T

Have you ever wondered what happened to all the privately-addressed traffic coming from any home network?

Well, if it isn't explicitly blocked by the business, it's routed, and this is not good. Imagine what data leakage can occur when a user mistypes a destination IP - the traffic goes out to the Service Provider, who will probably drop it somewhere, but it's inviting wiretapping/hijacking to occur.

RFC 1918 over the internet is part of a larger family of addresses called "bogons", an industry term to indicate a short list of prefixes that shouldn't be publicly routed.

Many network attacks traversing the public internet flow from what the industry calls "fullbogons", or addresses that, while publicly routable, aren't legitimate. These addresses are obviously block-able, with no legitimate uses. 

As it turns out, the industry calls these types of network traffic Internet background noise, and recent IPv4 shortages have pushed some providers (Cloudflare in particular) into implementing on previous fullbogon space and shouldering the noise from an internet-load of mis-configured network devices. 

The solution for mitigating both problems is the same: filtering that network traffic. Team Cymru provides public services that list all bogon types for public ingestion, all that needs to be done here is implementation and automation.

Bogon strategies

Given that the bogon list is extremely short, bogons SHOULD be implemented as null routes on perimeter routing. Due care may be required when filtering RFC 1918 in enterprise deployments with this method - Longest Prefix Match (LPM) will ensure that any specifically routed prefix will stay reachable, as long as dynamic routing is present and not automatically summarizing to the RFC 1918 parent. If this is a concern, implement what's possible today and build a plan for what isn't later.

Here's an example of how to implement with VyOS:

protocols {
    static {
        route {
            blackhole {
        route {
            blackhole {
        route {
            blackhole {
        route {
            blackhole {
        route {
            blackhole {
        route {
            blackhole {
        route {
            blackhole {
        route {
            blackhole {
        route {
            blackhole {
        route {
            blackhole {
        route {
            blackhole {

This approach is extremely straightforward and provides almost instant value.

Fullbogon strategies

For smaller enterprises and below (in this case, "smaller enterprise" means unable to support 250k+ prefixes via BGP, so nearly everybody) the most effective path to mitigate fullbogons isn't routing. Modern policy based firewalls typically have features that can subscribe to a list and perform policy-level packet filtering. The following are examples of firewall platform built-ins that let you just subscribe to a service:

In all of these cases, policies must be configured to enforce on traffic in addition to ingesting the threat feeds.

We can build this on our own, though. Since NSX-T has a policy API, let's apply it to a manager:

The method I provided here can be applied to any IP list with some minimal customization. There is only really one key drawback to this population method - the 4,000 object limit.

GitOps with NSX Advanced Load Balancer and Jenkins


GitOps, a term coined in 2017, describes the practice of performing infrastructure operations from a Git repository. In this practice, we easily develop the ability to re-deploy any broken infrastructure (like application managers), but that doesn't really help infrastructure engineers.

From the perspective of an Infrastructure Engineer, Git has a great deal to offer us:

  • Versioning: Particularly with the load balancer example, many NEPs (Network Equipment Providers) expose object-oriented profiles, allowing services consuming the network to leverage versioned profiles by simply applying them to the service.
  • Release Management: Most enterprises don't have non-production networks to test code, but having a release management strategy is a must for any infrastructure engineer. At a minimum, Git provides the following major tools for helping an infrastructure engineer ensure quality when releasing changes:
    • Collaboration/Testing: Git's Branch/Checkout features contribute a great deal to allowing teams to plan changes on their own infrastructure. If virtual (simulated production) instances of infrastructure are available, this becomes an incredibly powerful tool
    • Versioning: Git's Tags feature provides an engineer the capability of declaring "safe points" and clear roll-backs sets in the case of disaster.
    • Peer Review: Git's Pull Request feature is about as good as it gets in terms of peer review tooling. When releasing from the "planning" branch to a "production" branch, just create a Pull Request providing notification that you want the team to take a look at what you indent to do. Bonus Points for performing automated testing to help the team more effectively review the code.

Applying GitOps

On Tooling

Before visiting this individual implementation, none of these tools are specific or non-replaceable. The practice is what matters more than the tooling, and there are many equivalents here:

  • Jenkins: Harness, Travis CI, etc
  • GitHub: GitLab, Atlassian, Gitea, etc.
  • Python: Ansible, Terraform, Ruby, etc.

GitOps is pretty easy to implement (mechanically speaking). Any code designed to deploy infrastructure will execute smoothly from source control when the CI tooling is completely set up. All of the examples provided in this article are simple and portable to other platforms.

On Practice

This is just an example to show how the work can be executed. The majority of the work in implementing GitOps lies with developing release strategy, testing, and peer review processes. The objective is to improve reliability, not to recover an application if it's destroyed.

It does help deploy consistently to new facilities, though.

Let's go!

Since we've already developed the code in a previous post, most of the work is already completed - the remaining portion is simply configuring a CI tool to execute and report.

A brief review of the code ( shows it was designed to be run headless and create application profiles. Here are some key features for pipeline executed code to keep in mind:

  • If there's a big enough problem, crash the application so there's an obvious failure. Choosing to crash may feel overly dramatic in other applications, but anything deeper than pass/fail takes more comprehensive reporting. Attempt to identify "minor" versus "major" failures when deciding to crash the build. It's OK to consider everything "major".
  • Plan the code to leverage environment variables where possible, as opposed to arguments
  • Generate some form of "what was performed" report in the code. CI tools can email or webhook notify, and it's good to get a notification of a change and what happened (as opposed to digging into the audit logs on many systems!)
  • Get a test environment. In terms of branching strategy, there will be a lot of build failures and you don't want that to affect production infrastructure.
  • Leverage publicly available code where possible! Ansible (when fully idempotent) fits right into this strategy, just drop the playbooks into your Git repository and pipeline.

Pipeline Execution

Here's the plan. It's an (oversimplified) example of a CI/CD pipeline - we don't really need many of the features required by a CI tool here:

  • Pull code from a Git Repository + branch
    • Jenkins can support a schedule as well, but with GitOps you typically just have the pipeline check in to SCM and watch for changes.
  • Clear workspace of all existing data to ensure we don't end up with any unexpected artifacts
  • Load Secrets (username/password)
  • "Build". This stage, since we don't really have to compile, simply lets us execute shell commands.
  • "Post-build actions". Reporting on changed results is valuable and important, but the code will also have to be tuned to provide a coherent change report that turns to code. Numerous static analysis tools can also be run and reported on from here.

The configuration is not particularly complex because the code is designed for it:


This will perform unit testing first, then execute and provide a report on what changed.

Building from Code

The next phase to GitOps would be branch management. since the production or main branch now represents production, it's not particularly wise to simply commit to it when we attempt to create a new feature or capability. We're going to prototype next:

  • Outline what change we want to make with a problem statement

  • Identify the changes desired, and build a prototype. Avi is particularly good at this, because each object created can be exported as JSON once we're happy with it.
    • This can be done either as-code, or by the GUI with an export. Whichever works best.
  •  Determine any versioning desired. Since we're going to make a functional but not breaking change, SemVer doesn't let us increment the third number. Instead, we'll target version v0.1.0 for this release.
  • Create a new branch, and label in a way that's useful, e.g. clienttls-v0.1.0-release
  • Generate the code required. Note: If you use the REST client, this is particularly easy to export:
    • python3 -m restify -f nsx-alb/settings.json get_tls_profile --vars '{\"id\": \"clienttls-prototype-v0.1.0\"}' 
  • Place this as a JSON file in the desired profile folder. 
  • Add the new branch to whatever testing loop (preferably the non-prod instance!) is currently used to ensure that the build doesn't break anything.
  • After a clean result from the pipeline, create a pull request (Example: Notice how easy it is to establish peer reviews with this method!

After the application, we'll see the generated profiles here:

What's the difference?

When discussing this approach with other infrastructure engineers, the answer is "not much". GitOps is not useful without good practice. GitOps, put simply, makes disciplined process easier:

  • Peer Review: Instead of meetings, advance reading, some kind of Microsoft Office document versioning and comments, a git pull request is fundamentally better in every way, and easier too. GitHub even has a mobile app to make peer review as frictionless as possible
  • Testing: Testing is normally a manual process in infrastructure if performed at all. Git tools like GitHub and Bitbucket support in-line reporting, meaning that tests not only cost zero effort, but the results are automatically added to your pull requests!
  • Sleep all night: It's really easy to set up a 24-hour pipeline release schedule, so that roll to production could happen at 3 AM with no engineers awake unless there's a problem

To summarize, I just provided a tool-oriented example, but the discipline and process is what matters. The same process would apply to:

  • Bamboo and Ansible
  • Harness and Nornir
The only thing missing is more systems with declarative APIs.

Popular Posts