Sunday, January 16, 2022

Bogons, and how to leverage public IP feeds with NSX-T

Have you ever wondered what happened to all the privately-addressed traffic coming from any home network?

Well, if it isn't explicitly blocked by the business, it's routed, and this is not good. Imagine what data leakage can occur when a user mistypes a destination IP - the traffic goes out to the Service Provider, who will probably drop it somewhere, but it's inviting wiretapping/hijacking to occur.

RFC 1918 over the internet is part of a larger family of addresses called "bogons", an industry term to indicate a short list of prefixes that shouldn't be publicly routed.

Many network attacks traversing the public internet flow from what the industry calls "fullbogons", or addresses that, while publicly routable, aren't legitimate. These addresses are obviously block-able, with no legitimate uses. 

As it turns out, the industry calls these types of network traffic Internet background noise, and recent IPv4 shortages have pushed some providers (Cloudflare in particular) into implementing on previous fullbogon space and shouldering the noise from an internet-load of mis-configured network devices. 

The solution for mitigating both problems is the same: filtering that network traffic. Team Cymru provides public services that list all bogon types for public ingestion, all that needs to be done here is implementation and automation.

Bogon strategies

Given that the bogon list is extremely short, bogons SHOULD be implemented as null routes on perimeter routing. Due care may be required when filtering RFC 1918 in enterprise deployments with this method - Longest Prefix Match (LPM) will ensure that any specifically routed prefix will stay reachable, as long as dynamic routing is present and not automatically summarizing to the RFC 1918 parent. If this is a concern, implement what's possible today and build a plan for what isn't later.

Here's an example of how to implement with VyOS:


protocols {
    static {
        route 10.0.0.0/8 {
            blackhole {
            }
        }
        route 10.66.0.0/16 {
            blackhole {
            }
        }
        route 100.64.0.0/10 {
            blackhole {
            }
        }
        route 169.254.0.0/16 {
            blackhole {
            }
        }
        route 172.16.0.0/12 {
            blackhole {
            }
        }
        route 192.0.2.0/24 {
            blackhole {
            }
        }
        route 192.88.99.0/24 {
            blackhole {
            }
        }
        route 192.168.0.0/16 {
            blackhole {
            }
        }
        route 198.18.0.0/15 {
            blackhole {
            }
        }
        route 198.51.100.0/24 {
            blackhole {
            }
        }
        route 203.0.113.0/24 {
            blackhole {
            }
        }
    }
}

This approach is extremely straightforward and provides almost instant value.

Fullbogon strategies

For smaller enterprises and below (in this case, "smaller enterprise" means unable to support 250k+ prefixes via BGP, so nearly everybody) the most effective path to mitigate fullbogons isn't routing. Modern policy based firewalls typically have features that can subscribe to a list and perform policy-level packet filtering. The following are examples of firewall platform built-ins that let you just subscribe to a service:

In all of these cases, policies must be configured to enforce on traffic in addition to ingesting the threat feeds.

We can build this on our own, though. Since NSX-T has a policy API, let's apply it to a manager:

The method I provided here can be applied to any IP list with some minimal customization. There is only really one key drawback to this population method - the 4,000 object limit.

GitOps with NSX Advanced Load Balancer and Jenkins

GitOps

GitOps, a term coined in 2017, describes the practice of performing infrastructure operations from a Git repository. In this practice, we easily develop the ability to re-deploy any broken infrastructure (like application managers), but that doesn't really help infrastructure engineers.

From the perspective of an Infrastructure Engineer, Git has a great deal to offer us:

  • Versioning: Particularly with the load balancer example, many NEPs (Network Equipment Providers) expose object-oriented profiles, allowing services consuming the network to leverage versioned profiles by simply applying them to the service.
  • Release Management: Most enterprises don't have non-production networks to test code, but having a release management strategy is a must for any infrastructure engineer. At a minimum, Git provides the following major tools for helping an infrastructure engineer ensure quality when releasing changes:
    • Collaboration/Testing: Git's Branch/Checkout features contribute a great deal to allowing teams to plan changes on their own infrastructure. If virtual (simulated production) instances of infrastructure are available, this becomes an incredibly powerful tool
    • Versioning: Git's Tags feature provides an engineer the capability of declaring "safe points" and clear roll-backs sets in the case of disaster.
    • Peer Review: Git's Pull Request feature is about as good as it gets in terms of peer review tooling. When releasing from the "planning" branch to a "production" branch, just create a Pull Request providing notification that you want the team to take a look at what you indent to do. Bonus Points for performing automated testing to help the team more effectively review the code.

Applying GitOps

On Tooling

Before visiting this individual implementation, none of these tools are specific or non-replaceable. The practice is what matters more than the tooling, and there are many equivalents here:

  • Jenkins: Harness, Travis CI, etc
  • GitHub: GitLab, Atlassian, Gitea, etc.
  • Python: Ansible, Terraform, Ruby, etc.

GitOps is pretty easy to implement (mechanically speaking). Any code designed to deploy infrastructure will execute smoothly from source control when the CI tooling is completely set up. All of the examples provided in this article are simple and portable to other platforms.

On Practice

This is just an example to show how the work can be executed. The majority of the work in implementing GitOps lies with developing release strategy, testing, and peer review processes. The objective is to improve reliability, not to recover an application if it's destroyed.

It does help deploy consistently to new facilities, though.

Let's go!

Since we've already developed the code in a previous post, most of the work is already completed - the remaining portion is simply configuring a CI tool to execute and report.

A brief review of the code (https://github.com/ngschmidt/python-restify/blob/main/nsx-alb/apply_idempotent_profiles.py) shows it was designed to be run headless and create application profiles. Here are some key features for pipeline executed code to keep in mind:

  • If there's a big enough problem, crash the application so there's an obvious failure. Choosing to crash may feel overly dramatic in other applications, but anything deeper than pass/fail takes more comprehensive reporting. Attempt to identify "minor" versus "major" failures when deciding to crash the build. It's OK to consider everything "major".
  • Plan the code to leverage environment variables where possible, as opposed to arguments
  • Generate some form of "what was performed" report in the code. CI tools can email or webhook notify, and it's good to get a notification of a change and what happened (as opposed to digging into the audit logs on many systems!)
  • Get a test environment. In terms of branching strategy, there will be a lot of build failures and you don't want that to affect production infrastructure.
  • Leverage publicly available code where possible! Ansible (when fully idempotent) fits right into this strategy, just drop the playbooks into your Git repository and pipeline.

Pipeline Execution

Here's the plan. It's an (oversimplified) example of a CI/CD pipeline - we don't really need many of the features required by a CI tool here:

  • Pull code from a Git Repository + branch
    • Jenkins can support a schedule as well, but with GitOps you typically just have the pipeline check in to SCM and watch for changes.
  • Clear workspace of all existing data to ensure we don't end up with any unexpected artifacts
  • Load Secrets (username/password)
  • "Build". This stage, since we don't really have to compile, simply lets us execute shell commands.
  • "Post-build actions". Reporting on changed results is valuable and important, but the code will also have to be tuned to provide a coherent change report that turns to code. Numerous static analysis tools can also be run and reported on from here.

The configuration is not particularly complex because the code is designed for it:

 

This will perform unit testing first, then execute and provide a report on what changed.

Building from Code

The next phase to GitOps would be branch management. since the production or main branch now represents production, it's not particularly wise to simply commit to it when we attempt to create a new feature or capability. We're going to prototype next:

  • Outline what change we want to make with a problem statement

  • Identify the changes desired, and build a prototype. Avi is particularly good at this, because each object created can be exported as JSON once we're happy with it.
    • This can be done either as-code, or by the GUI with an export. Whichever works best.
  •  Determine any versioning desired. Since we're going to make a functional but not breaking change, SemVer doesn't let us increment the third number. Instead, we'll target version v0.1.0 for this release.
  • Create a new branch, and label in a way that's useful, e.g. clienttls-v0.1.0-release
  • Generate the code required. Note: If you use the REST client, this is particularly easy to export:
    • python3 -m restify -f nsx-alb/settings.json get_tls_profile --vars '{\"id\": \"clienttls-prototype-v0.1.0\"}' 
  • Place this as a JSON file in the desired profile folder. 
  • Add the new branch to whatever testing loop (preferably the non-prod instance!) is currently used to ensure that the build doesn't break anything.
  • After a clean result from the pipeline, create a pull request (Example: https://github.com/ngschmidt/python-restify/pull/17). Notice how easy it is to establish peer reviews with this method!

After the application, we'll see the generated profiles here:

What's the difference?

When discussing this approach with other infrastructure engineers, the answer is "not much". GitOps is not useful without good practice. GitOps, put simply, makes disciplined process easier:

  • Peer Review: Instead of meetings, advance reading, some kind of Microsoft Office document versioning and comments, a git pull request is fundamentally better in every way, and easier too. GitHub even has a mobile app to make peer review as frictionless as possible
  • Testing: Testing is normally a manual process in infrastructure if performed at all. Git tools like GitHub and Bitbucket support in-line reporting, meaning that tests not only cost zero effort, but the results are automatically added to your pull requests!
  • Sleep all night: It's really easy to set up a 24-hour pipeline release schedule, so that roll to production could happen at 3 AM with no engineers awake unless there's a problem

To summarize, I just provided a tool-oriented example, but the discipline and process is what matters. The same process would apply to:

  • Bamboo and Ansible
  • Harness and Nornir
The only thing missing is more systems with declarative APIs.

Sunday, January 2, 2022

Leverage Idempotent, Declarative Profiles with the NSX-ALB (Avi) REST API

 Idempotence and Declarative Methods - not just buzzwords

Idempotence

Coined by Benjamin Peirce, this term indicates that a mathematical operation will produce a consistent result, even with repetition.

Idempotence is much more complicated subject in mathematics and computer science. IT and DevOps use a simplified version of this concept, commonly leveraging flow logic instead of Masters-level Algebra.

Typically, an idempotent function in DevOps-land adds a few other requirements to the mix:

  • If a change is introduced, convergence (the act of making an object match what the consumer asked for) should be non-invasive and safe
    • It's the responsibility of the consumer to adequately test this
  •  Provide a "What If?" function of some kind, indicating how far off from desired state a system is
    • It's the responsibility of the consumer to adequately test this. Idempotent systems should provide a method for indicating what will change, but won't always provide a statement of impact

 Ansible's modules are a good example of idempotent functions, but Ansible doesn't require that everything be idempotent. Some good examples exist of methods that cannot be idempotent, re-defining it to add the "do no harm" requirement:

  • Restarting a service
  • Deleting and re-adding a file

As a result, many contributed modules are not pressured to be idempotent when they should be. It's the responsibility of the consumer (probably you) to verify things don't cause harmful change.

Declarative Methods

Lori MacVittie (F5 Networks) provides an excellent detailed explanation of Declarative Models here:

https://www.f5.com/company/blog/why-is-a-declarative-model-important-for-netops-automation

Declarative Methods provide a system interface that can be leveraged by a non-Expert, by allowing a consumer to specify what the consumer wants instead of how to build it (an Imperative method).

This is a huge issue in the IT industry in general, because we (incorrectly) conflate rote memorization of individual imperative methods with capability. In the future, the IT industry will be forced to transform away from this highly negative cultural pattern

We as professionals need to solve two major problems to assist in this transition:

  • Find a way to somehow teach fundamental concepts without imperative methods
  • Teach others to value the ability to effectively define what they desire in a complete and comprehensive way

If you've ever been frustrated by an IT support ticket that has some specific steps and a completely vague definition of success, declarative methods are for you. The single most important aspect of declarative methods is that  the user/consumer's intent is captured in a complete and comprehensive way. If a user fails to define their intent in modern systems like Kubernetes, the service will fail to build. In my experience, problem #1 feeds into problem #2, and some people just think they're being helpful by requesting imperative things.

Obviously the IT industry won't accept that a computer system is allowed to deny them if they failed to define everything they need to. This is where expertise comes in.

How we can use it in DevOps

Here's the good news - designing and providing systems to provide idempotent, declarative methods of cyclical convergence isn't really an enterprise engineer's responsibility. Network Equipment Providers (NEP) and systems vendors like VMware are on the hook for that part. We can interact with provided functions leveraging some relatively simple flow logic:

Well-designed APIs (NSX ALB and NSX-T Data Center are good examples) provide a declarative method, ideally versioned (minor gripe with NSX ALB here, the message body contains the version and may be vestigial), and all we have to do is execute and test.

In a previous post, I covered that implementing reliability is the consumer's responsibility, transforming a systems engineer's role into one of testing, ensuring quality and alignment of vision as opposed to taking on all of the complex methods ourselves

TL;DR Example Time, managing Application Profiles as Code (IaC)

 Let's start by preparing NSX ALB(Avi) for API access. The REST client I'm using uses HTTP Basic Authentication, so it must be enabled - the following setting is under System -> Settings -> Access Settings:

Note: In a production deployment other methods like JWT ought to be used.

The best place to begin here with a given target is to consult the API documentation, provided here: https://avinetworks.com/docs/21.1/api-guide/ApplicationProfile/index.html

When reviewing the documentation VMware provides, declarative CRUD methods are all provided (GET, PUT, PATCH, DELETE) for an individual application profile. Let's leverage the workflow above as code (Python 3)


# Recursively converge application profiles
def converge_app_profile(app_profile_dict):
    # First, grab a copy of the existing application profile
    before_app_profile = json.loads(
        cogitation_interface.namshub(
            "get_app_profile", namshub_variables={"id": app_profile_dict["uuid"]}
        )
    )

    # Fastest and cheapest compare operation first
    if not app_profile_dict["profile"] == before_app_profile:
        # Build a deep difference of the two dictionaries, removing attributes that are not part of the profile, but the API generates
        diff_app_profile = deepdiff.DeepDiff(
            before_app_profile,
            app_profile_dict["profile"],
            exclude_paths=[
                "root['uuid']",
                "root['url']",
                "root['uuid']",
                "root['_last_modified']",
                "root['tenant_ref']",
            ],
        )

        # If there are differences, try to fix them at least 3 times
        if len(diff_app_profile) > 0 and app_profile_dict["retries"] < 3:
            print("Difference between dictionaries found: " + str(diff_app_profile))
            print(
                "Converging "
                + app_profile_dict["profile"]["name"]
                + " attempt # "
                + str(app_profile_dict["retries"] + 1)
            )
            # Increment retry counter
            app_profile_dict["retries"] += 1
            # Then perform Update verb on profile
            cogitation_interface.namshub(
                "update_app_profile",
                namshub_payload=app_profile_dict["profile"],
                namshub_variables={"id": app_profile_dict["uuid"]},
            )
            # Perform recursion
            converge_app_profile(app_profile_dict)
        else:
            return before_app_profile
Idempotency is easy to achieve, we leverage the deepdiff library to process data handled by a READ action, and then execute a re-apply action if it doesn't match. This method will allow me to just mash the execute key until I feel good with the results. I've included a retry counter as well to prevent looping.

That's actually all there is to it - this method can be combined with Semantically Versioned Profiles. I have provided public examples on how to execute that in the source code: https://github.com/ngschmidt/python-restify/tree/main/nsx-alb

Sunday, December 26, 2021

The winds of change in cloud operations, and why integrations like NSX Data Center 3.2 + Advanced Load Balancer are important

The Jetstreams

Cloud operators now provide two completely different classes of service to customers:

  • Self-Service, VMs, Operating System Templates
    • Generally mature, some private cloud operators are smoothing out CMPs or such, but work as intended from a customer perspective
    • Bringup is automated
    • Operating System level configuration is usually automated
    • Application-level configuration is not always automated or managed as code
    • Cloud Provider typically will hold responsibility for a widget working
  • Containers, Service Definitions, no GUI
    • Kubernetes fits squarely here, but other services exist
    • Not the most customer friendly, nascent 
    • Application Owner has to hold responsibility for a widget working

Infrastructure Engineers as Agents of Change

The industry cannot transition responsibility for "stuff working" to creative types (Web Developers, App Developers). Have you ever heard "it's the network"? How about "This must be a <issue I'm not responsible for>"?

This is a call for help. Once the current trends with Automation and reliability engineering slow down (Type 1 above), the second kind of automation is going to necessitate leveraging infrastructure expertise elsewhere. Services like Kubernetes both require a "distribution" of sorts, but there's nobody to blame when something fails to deploy.

Why This Matters

NSX-T's 3.2 release has provided a ton of goodies, with an emphasis on centralized management and provisioning. We're starting to see tools that will potentially support multiple inbound declarative interfaces to achieve similar types of work, and NSX Data Center Manager has all the right moving parts to provide that.

NSX ALB's Controller interface provides comprehensive self-service and troubleshooting information, and a "Lite" service portal.

NSX Datacenter + ALB presents a really unique value set, with one provisioning point for all services, in addition to the previously provided Layer 3 fabric integration. It's good to see this type of write-many implementation

Let's try it out!

First, let's cover some prerequisites (detailed list here: https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.2/installation/GUID-3D1F107D-15C0-423B-8E79-68498B757779.html):

  • Greenfield Deployment only. This doesn't allow you to pick up existing NSX ALB installations
  • NSX ALB version 20.1.7+ or 21.1.2+ OVA
  • NSX Data Center 3.2.0+
  • NSX ALB and NSX Data Center controllers must exist on the same subnet
  • NSX Data Center and NSX ALB clusters should have a vIP configured
  • ...it also can't support third party CAs

Once these are met, the usual prerequisites also matter. 

Deployment is extremely straightforward, and managed under System -> Appliances. This wizard will require you to upload the OVA, so get that rolling before filling out any forms:


The NSX Manager will take care of the VM deployment for you. Interestingly enough, this will allow us to potentially get rid of tools like PyVmOmi and let us deploy everything with Ansible/Terraform someday.

Once it's done deploying the first appliance, it'll report a "Degraded" state until 3 controllers are deployed.

Once installed, the NSX ALB objects should appear under Networking -> Advanced Load Balancer:

At this point, NSX Datacenter -> NSX ALB is integrated, but not ALB -> Data Center. The next step is to configure an NSX-T cloud. I've covered the procedure for configuring an NSX-T cloud here: https://blog.engyak.co/2021/09/vmware-nsx-alb-avi-networks-and-nsx-t.html

Note: Using a non-default CA Certificate for NSX ALB here will break the integration. It can be reverted back by reverting the certificate, and there doesn't appear to be an obvious way to change that yet. A Principal Identity is formed for the connection between systems, indicating that the feature is just not fully exposed to users yet.

Viewing Services

A cursory review of the new ALB section indicates that existing vIPs don't appear via the ALB GUI, but the inverse is true. Let's try and build one for Jenkins! The constructs are essentially the same as with the ALB UI, but the process is considerably simpler:
 
First, create a vIP
Then, create the pool:

Finally, we will create the virtual service. Note: nullPointerException seems to mean that the SE Group is incorrect, and may need to be manually resolved on the ALB controller.

Unlike most VMware products, NSX Data Center seems to handle multi-write (changes from BOTH the ALB and the Manager) fairly well.

That's it!

Footnote: To use custom TLS profiles, it must be invoked via the API only. I am building a method to manage that here: https://github.com/ngschmidt/python-restify/tree/main/nsx-t/profiles/tls


Wednesday, December 22, 2021

NSX-T 3.2 and NSX ALB (Avi) Deployment Error - "Controller is not reachable. {0}"

NSX-T 3.2 has been released, and has a ton of spiffy features. The NSX ALB integration is particularly neat, but while repeatedly (repeatably) breaking the integration to learn more about it, I ran into this error:

When deploying NSX ALB appliances from the NSX Manager, it's very important to keep the NSX ALB Controller appliances where NSX Manager can see them. In addition, the appliances must exist on the same Layer 2 Segment

This post is not about the integration, however.

The following error:

NSX Advanced Load Balancer Controller is not reachable {0}

Indicates that NSX-T has orphaned appliances. NSX-T has API invocations for cleaning this up, but not GUI integrations. This is similar to other objects, and is because programmatic checking should be used to allow this work to be reliable.

To fix this, we must perform the following steps:

  • Get the list of NSX ALB appliances, if there isn't any, exit
  • Iterate through the list of appliances, prompting the user to delete
  • After deleting, check to make sure that it was deleted

The first step for any API invocations should be consulting the documentation. The NSX ALB Appliance management section is 3.7.1.4. After researching the procedure, I found the following endpoints:

Performing this procedure with programmatic interfaces is a good example of when to use APIs - the task is well defined, the results are easy to test, and work to prevent user mistakes is rewarding.

TL;DR - I wrote the code here, integrating it with the REST client: https://github.com/ngschmidt/python-restify/blob/main/nsx-t/nsxalb_deployment_cleanup.py

Saturday, December 4, 2021

VyOS and other Linux builds unable to use `vmxnet3` or "VMware Paravirtual SCSI" adapter on vSphere

Have you seen this selector when building machines on vSphere?

This causes some fairly common issues in NOS VMs, as most don't really know what distribution the NOS is based on.

"Guest OS Version" doesn't just categorize your workload, though. Selecting "Other Linux" instructs vSphere to maximize compatibility and ensure the VI admin receives a reliable deployment - which means it'll run some pretty old virtual hardware.

VMware curates its lifecycle "Guest OS" settings here. Note that "Other" isn't described: https://partnerweb.vmware.com/GOSIG/home.html

Two commonly preferred settings for virtual hardware aren't available with this particular OS setting, and they both cause potential performance issues:
  • LSI Logic Virtual SCSI
  • Intel E1000 NIC <---If you're wondering, it will drop your VM's throughput
Let's cover how we'd fix this in vSphere 7 with a VM. The example in this procedure is VyOS 1.4.

Updating Paravirtualized Hardware


First, let's change the Guest OS version to something more specific. Generally, Linux distributions fall under two categories, Red-Hat, and Debian derivatives - Gentoo/Arch users won't be covered here because they should be able to find their own way out.

Since we know VyOS is a well-maintained distribution, I'll change it to "Debian 11." While this is technically lying, we're trying to provide a reference hardware version to the virtual machine, not accurately represent the workload. This menu can be reached by selecting "edit VM" on the vSphere console:

Second, let's change the SCSI Adapter:

Replacing network adapters will take a little bit more work. Re-typing existing interfaces is not currently supported in vSphere 7, so we'll need to delete and re-create. In this example, we can set a static MAC address so that the guest distribution can correlate the new adapter to the same interface by setting the MAC Address field to static. Since I'm life cycling a VM template, I don't want to do that!

If you're editing an existing VM, make a backup. If it's a NOS, export the configuration. There is no guarantee that the configurations will port over perfectly, and you will want a restore point. Fortunately, lots of options exist in the VMware ecosystem to handle this!

Refactoring / Recovering from the change

With my template VM, the only issues presented were that the interface re-numbered and the VRF needed to be re-assigned:

set interfaces ethernet eth2 address dhcp
set interfaces ethernet eth2 vrf mgmt
commit
save

Since we have the VM awake and in non-template-form, we can update the NOS too. (Guide here: https://docs.vyos.io/en/latest/installation/update.html)
vyos@vyos:~$ add system image https://downloads.vyos.io/rolling/current/amd64/vyos-rolling-latest.iso vrf mgmt
Trying to fetch ISO file from https://downloads.vyos.io/rolling/current/amd64/vyos-rolling-latest.iso
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 436M 100 436M 0 0 12.0M 0 0:00:36 0:00:36 --:--:-- 11.7M
ISO download succeeded.
Checking SHA256 (256-bit) checksum...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 106 100 106 0 0 215 0 --:--:-- --:--:-- --:--:-- 215
Found it. Verifying checksum...
SHA256 checksum valid.
Checking for digital signature file...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (22) The requested URL returned error: 404
Unable to fetch digital signature file.
Do you want to continue without signature check? (yes/no) [yes]
Checking MD5 checksums of files on the ISO image...OK.
Done!
What would you like to name this image? [1.4-rolling-202112040649]:
OK. This image will be named: 1.4-rolling-202112040649
Installing "1.4-rolling-202112040649" image.
Copying new release files...
Would you like to save the current configuration
directory and config file? (Yes/No) [Yes]: Yes
Copying current configuration...
Would you like to save the SSH host keys from your
current configuration? (Yes/No) [Yes]: No
Running post-install script...
Setting up grub configuration...
Done.
vyos@vyos:~$ show system image
The system currently has the following image(s) installed:

1: 1.4-rolling-202112040649 (default boot)
2: 1.4-rolling-202103130218 (running image)

vyos@vyos:~$ reboot
Are you sure you want to reboot this system? [y/N] y

Recap

To cover the major points of this article:
  • Selecting "Guest OS" in vSphere can present significant performance improvements or problems depending on what you choose. The selector chooses what PV hardware to provide to a VM, and it'll try to preserve compatibility and be conservative.
  • VM Hardware is a separate knob entirely, updating it won't make the newer hardware available without the "Guest OS" selector
  • Consult your NOS vendor on what to select here, if applicable, and require them to provide you documentation on why.
Some additional tangential benefits are present as a result of this change. For example, VM power actions work:

Since we're done, let's check this change into the image library:
Reference: https://blog.engyak.co/2020/10/using-vm-templates-and-nsx-t-for.html

Wednesday, November 24, 2021

Why Automate? Reliability Approaches with the VMware NSX-T API

Why should an infrastructure engineer leverage REST APIs?

I'm sure most IT workers have at least heard of REST APIs, or heard a sales pitch where a vendor insists that while a requested functionality doesn't exist, you could build it yourself by "using the API".

Or, participate in discussions where people seemed to try and offer you a copy of The DevOps Handbook or The Unicorn Project.

They're right, but software development and deployment methods have completely different guiding values than infrastructure management. Speed of delivery is almost completely worthless with infrastructure, where downtime is typically the only metric that infrastructure is evaluated on.

We need to transform the industry.

The industry has proven that "value-added infrastructure" is a thing that people want, otherwise, services like Amazon AWS, Azure, Lumen would not be profitable. Our biggest barrier to success right now is the perceptions around reliability because there clearly is demand for what we'd call abstraction of infrastructure. We can't move as slow as we used to, but we can't make mistakes either.

Stuck between a rock and a hard place?

I have some good news - everybody's just figuring this out as they go, and you don't have to start by replacing all of your day-to-day tasks with Ansible playbooks. Let's use automation tools to ensure Quality First, Speed Second. Machines excel at comparison operators, allowing an infrastructure administrator to test every possible aspect of infrastructure when executing a change. Here are some examples where I've personally seen a need for automation:
  • Large-scale routing changes: if 1,000 routes successfully migrate, and a handful of routes fail, manual checks tend to depend overly (unfairly) on the operator to eyeball the entire lot
    • Check: Before and after routes, export a difference
    • Check: All dynamic routing peers, export a difference
    • Reverse the process if anything fails
  • Certificate renewals
    • Check: If certificate exists
    • Check: If the certificate was uploaded
    • Check: If the certificate has a valid CA chain
    • Check: If the certificate was successfully installed
    • Reverse the process if anything fails
  • Adding a new VLAN or VNI to a fabric
    • Check: VLAN Spanning-Tree topology, export a difference
    • Check: EVPN AFI Peers, export a difference
    • Check: MAC Address Table, export a difference
    • Reverse the process if anything fails
The neat thing about this capability is the configuration reversal - API calls are incredibly easy to process in common programming languages (particularly compared to expect) and take fractions of a second to run - so if a tested process (it's easy to test, too!) does fail, reversion is straightforward. Let's cover the REST methods before exploring the deeper stuff like gNMI or YANG.

Anatomy of a REST call

When implementing a REST API call, a client request will have several key components:
  • Headers: Important meta-data about your request go here, the server should adhere to any specification provided in HTTP headers. If you're building API code or otherwise, I'd recommend just setting up a standard when reviewing the list of supported fields. Examples:
    • Authentication Attributes
    • {'content-type': 'application/xml'}
    • {'content-type': 'application/json'}
    • {'Accept-Encoding': 'application/json'}
    • {'Cache-Control': 'no-cache'}
  • Resource: This is specified by the Uniform Resource Indicator, the URL component after the system is specified. A resource is the "what" of a RESTful interaction.
  • Body: Free-form optional text, this component provides a payload for the API call. It's important to make sure that the server actually wants it!
  • Web Application Firewalls (WAF) can inspect header, verb, and body to determine if an API call is safe and proper. 
When implementing a REST API call, a server response will have several key components:
  • Headers: Identical use case, but keep in mind that headers from server to client will be following a different list.
  • Response Code: This should provide detail on the status of the API call.
    • In network automation, I strongly discourage simply trusting the response code as a means of testing for changes. It's better to make multiple GET requests to verify that the change was executed and provided the intended effects.
    • If implementing API-specific code, vendors will provide what each error code means specifically to them. Python supports constructing a dictionary with numeric indexes, a useful mechanism for mapping the vendor list, ex:
      • httperrors = {
        1: ('Unknown Command', 'The specific config or operational command is not recognized.'),
        2: ('Internal Error', 'Check with technical support when seeing these errors.'),
        3: ('Internal Error', 'Check with technical support when seeing these errors.'),
        4: ('Internal Error', 'Check with technical support when seeing these errors.'),
        5: ('Internal Error', 'Check with technical support when seeing these errors.')
        }
  • Body: Ideally used for any additional detail on why the response provided executed with the status provided, but not mandatory.

    Verb

    In a REST API, it's important to specify the TYPE of change you intend to make prior to actually invoking it. F5 Administrators will be familiar with this, with actions like tmsh create. We have 4 major REST verbs:
    • Create
    • Read
    • Modify/Update
    • Delete
    When you use a particular transport, you need to implement these verbs in a method native to that transport. This is significant when using other remote command methods like SSH (tmsh does this) or NetCONF or RESTCONF, all of which need a different method to implement.

    Fortunately for us, HTTP 1.1 seems like it's been made for this! HTTP has plenty of verbs that match the above, here's a brief decoder ring.
    • GET: READ-only request, typically does not include a message body. 
      • This will normally use a URI to specify what details you want to grab.
      • Since you're "getting" information here, typically you'd want to JSON pretty-print the output
    • POST: CREATE request, if you're making a new object on a remote system a message body is typically required and POST conveniently supports that.
      • POST should not overwrite existing data, but REST implementations vary!
    • POST: READ request, occasionally used when a query requires a message body. 
      • URIs don't always cut it when it comes to remote filtered requests or complex multi-tier queries.
      • Cisco NX-API avoids GET as a READ verb, and primarily uses POST instead with the REST verbs in the body
    • PUT: UPDATE request, is idempotent. Generally does not contain a lot of change safety, as it will implement or fully replace an object. 
      • Situations definitely exist that you want to be idempotent, and this is the verb for that.
      • Doesn't require a body
    • PATCH: MODIFY request, will only modify an existing object.
      • This will take considerably more work to structure, as PATCH can optionally be safely executed, but the responsibility for assembling requests safely in this manner is on the developer.
      • Most API implementations simply use POST instead and implement change safety in the back-end.
    • DELETE: DELETE request, does exactly what it sounds like, it makes a resource disappear.
    Nota Bene: None of this is a mandatory convention, so vendors may implement deviations from the REST spec. For example, Palo Alto will use XML and 0-100 series HTTP codes.

    Executing a REST Call

    Once the rules are set, the execution of a REST call is extremely easy, here's an example:
    curl -k -u admin https://nsx.lab.engyak.net/api/v1/alarms
    Enter host password for user 'admin':
    {
    "results" : [ {
    "id" : "3e79618a-c89e-477b-8872-f4c87120585b",
    "feature_name" : "certificates",
    "event_type" : "certificate_expiration_approaching",
    "feature_display_name" : "Certificates",
    "event_type_display_name" : "Certificate Expiration Approaching",
    "summary" : "A certificate is approaching expiration.",
    "description" : "Certificate 5c9565d8-2cfa-4a28-86cc-e095acba5ba2 is approaching expiration.",
    "recommended_action" : "Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the
    following NSX API POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id> where <cert-id> is the ID of a valid certificate reported by the GET /api/v1/trust-management/certificates NS
    X API. Once the expiring certificate is no longer in use, it should be deleted by invoking the DELETE /api/v1/trust-management/certificates/5c9565d8-2cfa-4a28-86cc-e095acba5ba2 NSX API.",
    "node_id" : "37e90542-f8b8-136e-59bc-5dd3b79b122b",
    "node_resource_type" : "ClusterNodeConfig",
    "entity_id" : "5c9565d8-2cfa-4a28-86cc-e095acba5ba2",
    "last_reported_time" : 1637510695463,
    "status" : "OPEN",
    "severity" : "MEDIUM",
    "node_display_name" : "nsx",
    "node_ip_addresses" : [ "10.66.0.204" ],
    "reoccurrences_while_suppressed" : 0,
    "entity_resource_type" : "certificate_self_signed",
    "alarm_source_type" : "ENTITY_ID",
    "alarm_source" : [ "5c9565d8-2cfa-4a28-86cc-e095acba5ba2" ],
    "resource_type" : "Alarm",
    "display_name" : "3e79618a-c89e-477b-8872-f4c87120585b",
    "_create_user" : "system",
    "_create_time" : 1635035211215,
    "_last_modified_user" : "system",
    "_last_modified_time" : 1637510695464,
    "_system_owned" : false,
    "_protection" : "NOT_PROTECTED",
    "_revision" : 353
    }
    Now - saving the cURL commands can be very administratively intensive - So I recommend some form of method to save and automate custom API calls. Quite a few more complex calls will require JSON payloads, variables, stuff like that.

    Executing a Procedure

    Planning the Procedure

    Here we'll use the API to resolve the following alarm. I'm going to use my own REST client, found here, because it's familiar. Let's write the desired result in pseudo-code first to develop a plan:

    • GET current cluster certificate ID
    • GET certificate store
    • PUT a replacement certificate with a new name
    • GET certificate store (validate PUT)
    • GET certificate ID (to further validate PUT). For idempotency, multiple runs should be supported.
    • POST update cluster certificate
    • GET current cluster certificate ID
    This process seems tedious, but computers don't ever get bored, and the objective here is to be more thorough than is reasonably feasible with manual execution! If you're thinking, "Gee, this is an awful lot of work!" trick rocks into doing it for you.

    Let's Trick Those Rocks

    Some general guidelines when scripting API calls:
    • Use a familiar language. An infrastructure engineer's goal with automation is reliability. Hiring trends, hipster cred, don't matter here. If you do best with a slide rule, use that.
    • Use libraries. An infrastructure engineer's goal with automation is reliability. Leverage libraries with publicly available testing results.
    • Log and Report: An infrastructure engineer's goal with automation is reliability. Report every little thing your code does to your infrastructure, and test code thoroughly.
    In this case, I published a wrapper for Python requests that allows me to save API settings here, and built a script on that library. Install it first:
    python3 -m pip install restify-ENGYAK

    From here, it's important to research the API calls required for this procedure (good thing we have the steps!). For NSX-T, the API Documentation is available here: https://developer.vmware.com/apis/1163/nsx-t

    NSX-T's Certificate management API also has a couple of quirks, where the Web UI and the API leverage different certificates. It's outlined here: https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.1/administration/GUID-50C36862-A29D-48FA-8CE7-697E64E10E37.html

    Since we're writing code for reliability

    I'd like to outline a rough idea of where my time investment was for this procedure. I hope it helps because the focus really isn't on writing code.
    • 50%: Testing and planning testing. I used Jenkins CI for this, and I'm not the most capable with it. This effort reduces over time, but does not reduce importance! Write your test cases before everything!
    • 30%: Research. Consulting the VMware API docs and official documentation was worth every yoctosecond - avoiding potential problems with planned work is critical (and there were some major caveats with the API implementation).
    • 10%: Updating the parent library, setting up the python environment. Most of this work is 100% re-usable.
    • 5%: Managing source code, Git branching, basically generating a bread-crumb trail for the implementation for when I don't remember it.
    • 5%: Actually writing code!
    I'm saving useful API examples in my public repository: https://github.com/ngschmidt/python-restify

    The Code

    # JSON Parsing tool
    import json
    
    # Import Restify Library
    from restify.RuminatingCogitation import Reliquary
    
    # Import OS - let's use this for passwords and usernames
    # APIUSER = Username
    # APIPASS = Password
    import os
    
    api_user = os.getenv("APIUSER")
    api_pass = os.getenv("APIPASS")
    
    # Set the interface - apply from variables no matter what
    cogitation_interface = Reliquary(
        "settings.json", input_user=api_user, input_pass=api_pass
    )
    
    # Build Results Dictionary
    stack = {
        "old_cluster_certificate_id": False,
        "old_certificate_list": [],
        "upload_result": False,
        "new_certificate_id": False,
        "new_certificate_list": [],
        "new_cluster_certificate_id": False,
    }
    
    # GET current cluster certificate ID
    stack["old_cluster_certificate_id"] = json.loads(
        cogitation_interface.namshub("get_cluster_certificate_id")
    )["certificate_id"]
    
    # GET certificate store
    for i in json.loads(cogitation_interface.namshub("get_cluster_certificates"))[
        "results"
    ]:
        stack["old_certificate_list"].append(i["id"])
    # We need to compare lists, so let's sort it first
    stack["old_certificate_list"].sort()
    
    # PUT a replacement certificate with a new name
    print(cogitation_interface.namshub("put_certificate", namshub_variables="cert.json"))
    
    # GET certificate store (validate PUT)
    for i in json.loads(cogitation_interface.namshub("get_cluster_certificates"))[
        "results"
    ]:
        stack["new_certificate_list"].append(i["id"])
    # We need to compare lists, so let's sort it first, then make it the difference between new and old
    stack["old_certificate_list"].sort()
    stack["new_certificate_list"] = list(
        set(stack["new_certificate_list"]) - set(stack["old_certificate_list"])
    )
    
    # Be Idempotent - this may be run multiple times, and should handle it accordingly.
    if len(stack["new_certificate_list"]) == 0:
        stack["new_certificate_id"] = input(
            "Change not detected! Please select a certificate to replace with: "
        )
    else:
        stack["new_certificate_id"] = stack["new_certificate_list"][0]
    
    # GET certificate ID (to further validate PUT)
    print(
        cogitation_interface.namshub(
            "get_cluster_certificate",
            namshub_variables=json.dumps({"id": stack["new_certificate_id"]}),
        )
    )
    # POST update cluster certificate
    print(
        cogitation_interface.namshub(
            "post_cluster_certificate",
            namshub_variables=json.dumps({"id": stack["new_certificate_id"]}),
        )
    )
    print(
        cogitation_interface.namshub(
            "post_webui_certificate",
            namshub_variables=json.dumps({"id": stack["new_certificate_id"]}),
        )
    )
    
    # GET current cluster certificate ID
    stack["new_cluster_certificate_id"] = json.loads(
        cogitation_interface.namshub("get_cluster_certificate_id")
    )["certificate_id"]
    
    # Show the results
    print(json.dumps(stack, indent=4))
    

    Sunday, October 10, 2021

    Get rid of certificate errors with Avi (NSX-ALB) and Hashicorp Vault!

     Have you ever seen this error before?

    This is a really important issue in enterprise infrastructure because unauthenticated TLS connections teach our end users to be complacent and ignore this error. 

    TLS Authentication

    SSL/TLS for internal enterprise administration typically only addresses the confidentiality aspects of an organizational need - yet the integrity aspects are not well realized:


    This is an important aspect of our sense of enterprise security, but the level of effort to authenticating information endpoints is high for TLS, so we make do with what we have. 

    The practice of ignoring authentication errors for decades has promoted complacency

    Here's another error that enterprise systems administrators see all the time:

    ssh {{ ip }}
    The authenticity of host '{{ ip }} ({{ ip }})' can't be established.
    RSA key fingerprint is SHA256:{{ hash }}.
    Are you sure you want to continue connecting (yes/no)?

    This probably looks familiar too - Secure Shell (SSH) follows a different method of establishing trust, where the user should verify that hash is correct by some method, and if it changes, it'll throw an error that we hopefully don't ignore:

    ssh {{ ip }}
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
    Someone could be eavesdropping on you right now (man-in-the-middle attack)!
    It is also possible that a host key has just been changed.
    The fingerprint for the RSA key sent by the remote host is
    SHA256:{{ hash }}.
    Please contact your system administrator.
    Add correct host key in known_hosts to get rid of this message.
    Offending ECDSA key in known_hosts
    {{ cipher }} host key for {{ ip }} has changed and you have requested strict checking.
    Host key verification failed.

    SSH is performing something very valuable here - authentication. By default, SSH will record a node's SSH hash in a file called known_hosts to ensure that the server is in fact the same as the last time you accessed it. In turn, once the server authenticates, you provide some level of authentication (user, key) afterward to ensure that you are who you say you are too. Always ensure that the service you're giving a secret to (like your password!) is authenticated or validated in some way first!

    Web of Trust versus Centralized Identity

    Web-of-Trust (WoT)

    Web-of-Trust (WoT) is typically the easiest form of authentication scheme to start out with but results in factorial scaling issues later on if executed properly. In this model, it's on the individual to validate identities from each peer they interact with since WoT neither requires nor wants a centralized authority to validate against.

    Typically enterprises use WoT because it's baked into a product, not specifically due to any particular need. Certain applications work well with it - so generally you should:
    • Keep your circle small
    • Replace crypto regularly
    • Leverage multiple identities for multiple tasks
      • e.g. separate your code signing keys from your SSH authentication keys
    Pros
    • Easy initial set-up
    • Doesn't depend on a third party to establish infrastructure
    Cons
    • The user is empowered to make both good and bad decisions, and the vast majority of users don't care enough about security to maintain vigilance
    • If you're in an organization with hundreds of "things to validate", you have to personally validate a lot of keys
    • It's a lot of work to properly validate - Ex. You probably don't ask for a person's ID every time you share emails with them
    • Revocation: If a key is compromised, you're relying on every single user to revoke it (or renew it, change your crypto folks) in a timely manner. This is a lot of work depending on how much a key is used.
    Examples: SSH, PGP

    Centralized Identity

    Centralized Identity services are the sweetheart of large enterprises. Put your security officers in charge of one of these babies and they'll make it sing

    In this model, it's on the Identity Administrator to ensure the integrity of any Identity Store. They'll typically do quite a bit better than your average WoT user because it's their job to do so.

    Centralized Identity services handle routine changes like ID refreshes/revocations much more easily with dedicated staffing - mostly because the application and maintainer are easy to define. But here's the rub, you have to be able to afford it. Most of the products that fit in this category are not free and require at least part-time supervision by a capable administrator.

    It's not impossible, though. One can build centralized authentication mechanisms with open source tooling, it just takes work. If you aren't the person doing this work, you should help them by being a vigilant user - if an identity was compromised, report it quickly, even if it was your fault - the time to respond here is vital. Try to shoulder some of this weight whenever you can - it's an uphill hike for the people doing it and every little contribution counts.

    Back to TLS and Certificates

    In the case of an Application Delivery Administrator, an individual is responsible for the integrity and confidentiality of the services they deliver. This role must work hand-in-glove with Identity administrators in principle, and both are security administrators at heart.

    This is really just a flowery way to say "get used to renewing and filing Certificate Signing Requests (CSRs)".

    In an ideal world, an Application Delivery Controller (ADC) will validate the integrity of a backend server (Real Server) before passing traffic to it, in addition to providing the whole "CIA Triad" to clients. Availability is an ADC's thing, after all.

    Realistically an ADC Administrator will only control one of these two legs - and it's plenty on its own. Here's one way to execute this model.

    Certificate Management

    Enough theory, let's do some things. First, we'll build a PKI inside of HashiCorp Vault - this assumes a full Vault installation. Here's a view of the planned Certificate Hierarchy:


    From the HashiCorp Vault GUI - let's set up a PKI secrets engine for the root CA:



    Note: Default duration is 30 days, so I've overridden this by setting the default and max-lifetime under each CA labeled as "TTL"
    Let's create the services and user CAs:


    This will provide a CSR - we need to sign it under the root CA:

    Copy the resulting certificate into your clipboard - these secrets engines are autonomous, and don't interoperate - so we'll have to install it into the intermediate CA.
    We install the certificate via the "Set signed intermediate" button in Vault:
    Now, we have a hierarchical CA!
    NB: You will need to create a Vault "role" here - https://www.vaultproject.io/docs/secrets/pki 
    Mega NB: The root CA should nominally be "offline" and at a minimum part of a separate Vault instance!

    For this post, we'll just be issuing certificates manually. We need to extract the intermediate and root certificates to install in NSX ALB and participating clients. These can be pulled from the root-ca module:
    Note: Vault doesn't come with a certificate reader as of 1.8.3. You can read these certificates with online tools or by performing the following command with OpenSSL:
    openssl x509 -in cert1.crt -noout -text


    Once we have the files, let's upload them to Avi:
    For each certificate, click "Root/Intermediate CA Certificate" and Import. Note that you do need to click on Validate before importing.

    Now that we have the CA available, we should start by authenticating Avi itself and create a controller certificate:
    Fulfilling the role of PKI Administrator, let's sign the CSR after verifying authenticity.
    Back to the role of Application Administrator! We've received the certificate, let's install it in the Avi GUI!
    Once we've verified the certificate is healthy, let's apply it to the management plane under Administration -> Settings -> Access Settings:
    At this point, we'll need to trust the root certificate created in Vault - else we'll still see certificate errors. Once that's done, we'll be bidirectionally authenticated with the Avi controller!

    From here on out - we'll be able to  leverage the same process, in short:
    • Under Avi -> Templates -> Security -> TLS/SSL Certificates, create a new Application CSR
      • Ensure that all appropriate Subject Alternative Names (SANs) are captured!
    • Under Vault -> svc-ca -> issued-certificates -> Sign Certificate
    • Copy issued certificate to TLS Certificate created in the previous step
    • Assign to a virtual service. Unlike F5 LTM, this is decoupled from the clientssl profile.

    Bogons, and how to leverage public IP feeds with NSX-T

    Have you ever wondered what happened to all the privately-addressed traffic coming from any home network? Well, if it isn't explicitly b...