Saturday, March 26, 2022

Cisco Modeling Labs

Ever wonder what it would be like to have a platform dedicated to Continuous Improvement / Testing / Labbing?

Cisco's put a lot of thought into good ways to do that, and Cisco Modeling Labs(CML) is the latest iteration of their solutions to provide this as a service to enterprises and casual users alike.
CML is the next logical iteration of Virtual Internet Routing Labs (VIRL) and is officially backed by Cisco with legal VNF licensing. It's a KVM type-2 hypervisor, and automatically handles VNF wiring with an HTML5/REST interface.

Cisco's mission is to provide a NetDevOps platform to network engineers, upgrading the industry's skillset to provide an entirely new level of reliability to infrastructure. Hardware improves over time, and refresh cycles complete - transferring the "downtime" problem from hardware/transceiver failure to engineer mistakes. NetDevOps is the antidote for this problem, infrastructure engineers should leverage automation to make any operation done on production equipment absolutely safe.

Solution Overview

Cisco Modeling Labs provides you the ability to:

  • Provision up to 20/40 Cisco NOS nodes (personal licensing) as you see fit
  • Execute "What if?" scenarios without having to pre-provision or purchase network hardware, improving service reliability
  • Develop IaC tools, Ansible playbooks and other automation on systems other than the production network
  • Leverage TRex (Cisco's network traffic generator) to make simulations more real
  • Deploy workloads to the CML fabric to take a closer look or add capabilities inside
  • Save and share labbed topologies
  • Do everything via the API. Cisco DevNet even supplies a Python client for CML to make the API adoption easy!

Some important considerations for CML:

  • Set up a new segment for management interfaces, ensuring that external CI Tooling/Ansible/etc can reach it.
  • VNFs are hungry. NX-OSv images are at the high end (8GB of memory each), and IOSv/CSR1000v will monopolize CPU. Make sure that plenty of resources are allocated
  • Leverages Cisco Smart Licensing. CML uses legitimate VNF images with legitimate licensing, but will need internet access 
  • CML does not provide SD-WAN features, Wireless, or Firepower appliance licensing, but does support deploying them

Let's Install CML! Cisco provides an ESXi (vSphere) compatible OVA:

After It's deployed, I assigned 4 vCPUs and 24 GB of memory, and attached the platform ISO. Search under software.cisco.com for Modeling Labs:

Once mounted, the wizard continues from there. CML will ask you for passwords, IP configurations, the typical accoutrements:

The installer will take about 10 minutes to run, as it copies all of the base images into your virtual machine. Once it boots up, CML has two interfaces:

  • https://{{ ip }}/ : The CML "Lab" Interface
  • https://{{ ip }}:9090/ : The Ubuntu "cockpit" view, manage the appliance and its software updates from here. Cisco's integrated CML into this GUI as well.

CML's primary interface will not allow workloads to be built until a license is installed. Licenses are fetched from the Cisco Learning Network Store under settings:

CML is ready to go! Happy Labbing!

Saturday, March 19, 2022

Deploy Root Certificates to Debian-based Linux systems with Ansible

 There are numerous advantages to deploying an internal root CA to an enterprise:

  • Autonomy: Enterprises can control how their certificates are issued, structured, and revoked independently of a third party.
    • Slow or fast replacement cycles are permissible if you control the infrastructure, letting you customize the CA to the business needs
    • Want to set rules for what asymmetric cryptography to use? Don't like SHA1? You're in control!
  • Cost: Services like Let's Encrypt break this a bit, but require a publicly auditable service. Most paid CAs charge per-certificate, which can really add up
  • Better than self-signed: Training users to ignore certificate errors is extremely poor cyber hygiene, leaving your users vulnerable to all kinds of problems
  • Multi-purpose: Certificates can be used for users, services, email encryption, getting rid of passwords. They're not just to authenticate web servers.

 The only major obstacle to internal CAs happens to be a pretty old one - finding a scalable way to deliver the root Certificate Authority to appropriate "trust stores" (they do exactly what it sounds like they do) on all managed systems. Here are a few "hot-spots" that I've found over the years, ordered from high-value, low effort to low-value, high effort. They're all worthwhile, so please consider it an "order of operations" and not an elimination list:

  • Windows Certificate Store: With Microsoft Windows' SChannel library, just about everything on the system will install a certificate in one move. I'm not a Windows expert, but this delivery is always the most valuable up-front.
  • Linux Trust Store: Linux provides a trust store in different locations depending on distribution base.
  • Firefox: Mozilla's NSS will store independently from Windows or Linux, and will need to be automated independently.
  • Java Trust Stores are also independently held and specific to deployed version. This will require extensive deployment automation (do it on install, and do it once).
  • Python also has a self-deployed trust store when using libraries like requests, but Debian/Ubuntu specific packages are tweaked to use the system. There are a ton of tweaks to just make it use the system store, but the easiest is to leverage REQUESTS_CA_BUNDLE as an environment variable pointing to your system trust store.

Hopefully it's pretty clear that automation is about to become your new best friend when it comes to internal CA administration. Let's outline how we'd want to tackle the Linux aspects of this problem:

  • Pick up the root certificate, and deliver from the Controller to the managed node
    • Either Git or an Artifacts store would be adequate for publishing a root certificate for delivery. For simplicity's sake, I'll be adding it to the Git repository.
    • Ansible's copy module enables us to easily complete this task, and is idempotent.
  • Install any software packages necessary to import certificates into the trust store
    • Ansible's apt module enables us to easily complete this task, and is idempotent.
  • Install the certificate into a system's trust store
    • Locations differ based on distribution. Some handling to detect operating system and act accordingly will be worthwhile in mixed environment
    • Ansible's shell module can be used, but only as a fallback. It's not idempotent, and can be distribution-specific.
  • Restart any necessary services

Here's where the beauty of idempotency really starts to shine. With Ansible, it's possible to just set a schedule for the playbook to execute in a CI tool like Jenkins. CI tools add some neat features here, like only executing on a source control change, which may not apply when using an artifacts store to deploy the root certificate.

In this example, I will be adding the play to my nightly update playbook to illustrate how easy this is:

After completion, this action can be tested by a wide variety of means - my favorite would be cURLing a web service that leverages the root CA:

curl https://nsx.engyak.co/

Saturday, March 12, 2022

Cloud-Scale Networking: NSX Datacenter Hierarchical Tier-0s, blending telecom with cloud

 VMware's NSX Datacenter product is designed for a bit more than single enterprise virtual networking and security.

When reviewing platform maximums (NSX-T 3.2 ConfigMax), the listed maximum number of Tier-1 routers is 4,000 logical routers. Achieving that number takes a degree of intentional design, however.

When building a multi-tenant cloud network leveraging NSX Data Center, the primary design elements are straightforward: 

  • Shared Services Tenant
    • A multi-tenant data center will always have common services like outbound connectivity, orchestration tooling, object storage, DNS.
    • This tenant is commonly offered as a component to a physical fabric with a dedicated Workload Domain (WLD), but can be fully virtualized and run on the commodity compute
    • Packaging shared services within a WLD will require repetitive instantiation of common services, but makes the service "anycast-like" in that it will be more resilient for relatively little effort
    • Implementation with hierarchical Tier-0s is comically easy, just attach the shared Tier-1 to the Infrastructure Tier-0!
    • When designing a shared services tenant, a "Globally Routable" prefix is highly recommended in IPv4 to ensure that no conflicts occur. With IPv6, all networks should have globally routable allocations
  • Scaling: Tenants, Tenants, and more Tenants!
    • Most fabric routers have a BGP Peer cap of 256 speakers:
      • Divide that number in half for dual-stack: 128 speakers
      • Remove n spine nodes, 124 speakers
      • Add 4-way redundancy (2 Edge Transport Nodes):  31 Speakers, or 8-way, 15 Speakers
    • For customer-owned routing, the scalability maximum of 4,000 logical routers is achievable without good planning

Let's take a look at an infrastructure blueprint for scaling out network tenancy:

The more "boring" version of tenancy in this model supports highly scalable networking, where a customer owns the Tier-1 firewall and can self-service with vCloud Director:


 

VRF-Lite allows an NSX Engineer to support 100 network namespaces per Edge Transport Node cluster. When leveraging this feature, Tier-1 Logical Routers can connect to a non-default network namespace and peer traditionally with infrastructure that is not owned by the Cloud Provider via Layer 2 Interconnection or something more scalable (like MPLS or EVPN).

Empowering Cloud-Scale Networking with a feature like drop-shipping customer workloads into MPLS is an incredibly powerful tool, not just with scalability,but with ease of management. NSX-T VRFs can peer directly with the PE, simplifying any LDP or SR implementations required to support tenancy at scale.

With this design, we'd simply add a VRF to the "Customer Tier-0" construct, and peer BGP directly with the MPLS Provider Edge (PE). An NSX-T WLD with 3.0 or newer can support 1,600 instances this way, where most Clos fabrics can support ~4,000. The only difference here is that WLDs can be scaled horizontally with little effort, particularly when leveraging VCF.

A tenant VRF or network namespace will still receive infrastructure routes, but BGP engineering, Longest Prefix Match (LPM), AS-Path manipulation all can be used to ensure appropriate pathing for traffic to customer premise, shared infrastructure or other tenants with traditional telecommunications practices. Optionally, the customer's VRF can even override the advertised default route, steering internet-bound traffic to an external appliance.

This reference design solves another substantial maintainability problem with VRF-Lite implementations - VRF Leaking. The vast majority of hardware based routing platforms do not have a good path to traverse traffic between network namespaces, and software based routing platforms struggle with the maintainability issues associated with using the internal memory bus as a "warp dimension".

With overlay networking, this is easily controlled. VRF constructs in NSX-T inherit the parent BGP ASN, which prevents transit by default.  The deterministic method to control route propagation between VRFs with eBGP is to replace the ASN of the tenant route with its own, e.g.:

Original AS-PathNew AS-Path
64906 6490564905 64905

AS-Path rewrites provide an excellent balance between preventing transit by default and easily, safely, and maintainably providing transitive capabilities.

The more canonical, CCIE-worthy approach to solving this problem is not yet viable at scale. Inter-Tier-0 meshing of iBGP peers is the only feature available, confederations and route reflectors are not yet exposed to NSX-T from FRRouting. When this capability is included in NSX Data Center, iBGP will be the way to go.

Let's build the topology:

NSX Data Center's super-power is creating segments cheaply and easily, so a mesh of this fashion can be executed two ways.

Creating many /31s, one for each router member as overlay segments. NSX-T does this automatically for Tier-0 clustering and Tier-1 inter-links, and GENEVE IDs work to our advantage here. Meshing in this manner is best approach overall, and should be done with automation in a production scenario. I will probably write a mesh generator in a future post.

In this case, we'll do it IPv6-style, by creating a single interconnection segment and attaching a /24/64 to it. Tier-0 routers will mesh BGP with each other over this link. I'm using Ansible here to build this segment, not many of the knobs and dials are necessary so it saves both time and physical space.

  tasks:
    - name: "eng-lab-vn-segment-ix-10.7.200.0_24:"
      vmware.ansible_for_nsxt.nsxt_policy_segment:
        hostname: "nsx.lab.engyak.net"
        username: '{{ lookup("env", "APIUSER") }}'
        password: '{{ lookup("env", "APIPASS") }}'
        validate_certs: False
        state: present
        display_name: "eng-lab-vn-segment-ix-10.7.200.0_24"
        transport_zone_display_name: "POTATOFABRIC-overlay-tz"
        replication_mode: "SOURCE"
        admin_state: UP
    - name: "eng-lab-vn-segment-ix-10.7.201.0_24:"
      vmware.ansible_for_nsxt.nsxt_policy_segment:
        hostname: "nsx.lab.engyak.net"
        username: '{{ lookup("env", "APIUSER") }}'
        password: '{{ lookup("env", "APIPASS") }}'
        validate_certs: False
        state: present
        display_name: "eng-lab-vn-segment-ix-10.7.201.0_24"
        transport_zone_display_name: "POTATOFABRIC-overlay-tz"
        replication_mode: "SOURCE"
        admin_state: UP

From here, we configure the following external ports and peers on the infrastructure Tier-0:

  • eng-lab-t0r00-ix-1: 10.7.200.1/24 (AS64905, no peer)
  • eng-lab-t0r00-ix-vrfs-1: 10.7.201.1/24 (AS64905, no peer)
  • eng-lab-t0r00-ix-2: 10.7.200.2/24 (AS64905, no peer)
  • eng-lab-t0r00-ix-vrfs-2: 10.7.201.2/24 (AS64905, no peer)
  • eng-lab-t0r01-ix-1: 10.7.200.11/24 (AS64906)
  • eng-lab-t0r10-ix-1: 10.7.201.100/24 (AS64906)

Let's build the tenant default Tier-0:

Then configure BGP and BGP Peers:

Voila! BGP peerings are up without any VLANs whatsoever! Next, the VRF:

For the sake of brevity, I'm skipping some of the configuration after that. The notable differences are that you cannot change the BGP ASN:

That's about it! Peering between NSX-T Tier-0 routers is a snap.

The next stage to building a cloud must be automation at this point. Enabling self-service instantiating Tier-1s and/or VRFs will empower a business to onboard customers quickly, so consistency is key. Building the infrastructure is just the beginning of the journey, as always.

Lessons Learned

The Service Provider community has only just scratched the surface of what VMware's NSBU has made possible. NSX Data Center is built from the ground up to provide carrier-grade telecommunications features at scale, and blends the two "SPs" (Internet and Cloud Service Providers) into one software suite. I envision that this new form of company will become some type of "Value-Added Telecom" and take the world by the horns in the near future.

Diving deeper into NSX-T's Service Provider features is a rewarding experience. The sky is the limit! I did discover a few neat possibilities with this structure and design pattern that may be interesting (or make/break a deployment!)

  • Any part of this can be replaced with a Virtual Network Function. Customers like a preferred Network Operating System (NOS), or simply want a NGFW in-place. VMware doesn't even try to prevent this practice, enabling it as a VM or a CNF (someday). If a Service Provider has 200 Fortinet customers, 1,000 Palo Alto customers, and 400 Checkpoint customers, all of them will be happy to know they can simply drop whatever they want, wherever they want.
  • Orchestration and automation tools can build fully functional simulated networks for a quick "what if?" lab as a vApp.

Credit where credit is due

The team at 27 Virtual provided the design scenario and the community opportunity to fully realize this idea, and were extremely tolerant of me taking an exercise completely off the rails. You can see my team's development work here: https://github.com/ngschmidt/nsx-ninja-design-v3

Sunday, March 6, 2022

VMware NSX-T and Ansible

What is the point of all this software-defined infrastructure if you don't use it?

In prior examples, it's a fairly straightforward path to SDN when deploying NSX Data Center, allowing a VI admin or network engineer to deploy virtual network resources via a GUI.

This isn't the end of an effort, but the start of a journey. Once the API is available, deployment of services on top of a virtual cloud network become easier

Setup



 
First, we need our CI tooling to be capable of leveraging VMware's NSX-T Community module:
ansible-galaxy collection install git+https://github.com/vmware/ansible-for-nsxt
Note: This requires Ansible 3.0 or higher to leverage the "Galaxy install from git+https" feature. This software package is not hosted on Ansible Galaxy
 

Building the playbook


To an Ansible engineer, this part might be a bit bothersome. Since we're interacting with an appliance, there are several differences compared to canonical Ansible playbooks:
  • Inventory: Playbooks for network providers specify targets inside of their module, not using Ansible's inventory. If an inventory is used, the playbook will execute once on every inventory host, targeting the same destination device - not good.
  • Credentials are stored internally to the leveraged module
  • Any parameters that would constitute the thing that the playbook should build

Now that we have the required software, the next step would be to map out what to build. Playbooks typically have documentation on what parameters it will accept to perform work. Example here.

Let's take what will probably be the most common deployment to automate - network segments:

Executing the Playbook

When writing playbooks that will be frequently re-used, I like to leverage Jinja in the playbook, denoted by {{}} to morph to whatever need I have at the moment. Ansible supports loading variables with the --extra-vars "@{{ filename }}" statement:

ansible-playbook create_segments.yml --extra-vars "@segment_vars.yml"

Credential Management

I've glossed over a particularly important aspect of automation here - what credentials do we use?
 
Generally, I prefer to perform "hands-free" execution of aspects like this, so the provided playbook is designed to leverage the "Credentials as Environment Variables" feature in Jenkins automatically.

What I think of the Module

ansible-for-nsxt is a community maintained module, so expectations for the automation features should be set at the price paid for the software. I've tested this platform quite a bit, and the issues that I encountered appeared to be code obfuscation.

Digging deeper into the Ansible modules themselves, the biggest reason for this is NSX-T's declarative API - it makes more sense to not code those things at all and simply leverage the API. This resonates with me quite a bit!

VMware's community also has built testing into the repository, which seems to indicate that that testing is automated!

I do have two complaints about this module:

  • No maintainer for Ansible Galaxy means that Ansible 2 users (Red Hat shops) will have a difficult time installing the software from GitHub
  • Not all modules fully support IPv6 yet

Lessons Learned

Testing Idempotency

When looking to leverage automation at work, change safety is often provided as the primary reason not to do it.

It's okay not to trust automation with your production network, that's what testing is for.

When implementing an automation play like this in a production network, we need to evaluate whether or not it's safe to execute.

The first aspect to test when planning to automate a play will be idempotency, or how executing a play should consistently cause the desired state defined in the playbook, and not to impact services unnecessarily (ex. by deleting and re-creating something).

Idempotency is extremely important with infrastructure, we can't afford downtime as easily as other IT professions. The good news is that it's pretty easy to achieve:

ansible-playbook create_segments.yml --extra-vars "@segment_vars.yml"
ansible-playbook create_segments.yml --extra-vars "@segment_vars.yml"

The second executed playbook should return a status of "ok" instead of "changed". 

If this test is passed, we'll evaluate the next aspect of idempotency by executing the playbook, then changing the segment's configuration in some way (GUI, API, Ansible playbook), and re-executing to ensure that the delta was detected by Ansible and can be resolved by it. I just create a second segment_vars_deviation.yml file and execute thusly:

ansible-playbook create_segments.yml --extra-vars "@segment_vars.yml"
ansible-playbook create_segments.yml --extra-vars "@segment_vars_deviation.yml"
ansible-playbook create_segments.yml --extra-vars "@segment_vars.yml"

This test should return "changed" for all three attempts.

More Testing

These two tests have extremely good coverage, indicating a high level of change safety for almost no effort. Additional tests to execute for a play like this can be mind-mapped or brain-stormed, and then coded from there. Here are some examples:

  • Check Tier-0 routes to ensure that the prefix built populated
  • Check Looking Glass to ensure the prefix is reachable everywhere
  • Check vCenter for the port-group created

Popular Posts