Sunday, December 26, 2021

The winds of change in cloud operations, and why integrations like NSX Data Center 3.2 + Advanced Load Balancer are important

The Jetstreams

Cloud operators now provide two completely different classes of service to customers:

  • Self-Service, VMs, Operating System Templates
    • Generally mature, some private cloud operators are smoothing out CMPs or such, but work as intended from a customer perspective
    • Bringup is automated
    • Operating System level configuration is usually automated
    • Application-level configuration is not always automated or managed as code
    • Cloud Provider typically will hold responsibility for a widget working
  • Containers, Service Definitions, no GUI
    • Kubernetes fits squarely here, but other services exist
    • Not the most customer friendly, nascent 
    • Application Owner has to hold responsibility for a widget working

Infrastructure Engineers as Agents of Change

The industry cannot transition responsibility for "stuff working" to creative types (Web Developers, App Developers). Have you ever heard "it's the network"? How about "This must be a <issue I'm not responsible for>"?

This is a call for help. Once the current trends with Automation and reliability engineering slow down (Type 1 above), the second kind of automation is going to necessitate leveraging infrastructure expertise elsewhere. Services like Kubernetes both require a "distribution" of sorts, but there's nobody to blame when something fails to deploy.

Why This Matters

NSX-T's 3.2 release has provided a ton of goodies, with an emphasis on centralized management and provisioning. We're starting to see tools that will potentially support multiple inbound declarative interfaces to achieve similar types of work, and NSX Data Center Manager has all the right moving parts to provide that.

NSX ALB's Controller interface provides comprehensive self-service and troubleshooting information, and a "Lite" service portal.

NSX Datacenter + ALB presents a really unique value set, with one provisioning point for all services, in addition to the previously provided Layer 3 fabric integration. It's good to see this type of write-many implementation

Let's try it out!

First, let's cover some prerequisites (detailed list here:

  • Greenfield Deployment only. This doesn't allow you to pick up existing NSX ALB installations
  • NSX ALB version 20.1.7+ or 21.1.2+ OVA
  • NSX Data Center 3.2.0+
  • NSX ALB and NSX Data Center controllers must exist on the same subnet
  • NSX Data Center and NSX ALB clusters should have a vIP configured
  • also can't support third party CAs

Once these are met, the usual prerequisites also matter. 

Deployment is extremely straightforward, and managed under System -> Appliances. This wizard will require you to upload the OVA, so get that rolling before filling out any forms:

The NSX Manager will take care of the VM deployment for you. Interestingly enough, this will allow us to potentially get rid of tools like PyVmOmi and let us deploy everything with Ansible/Terraform someday.

Once it's done deploying the first appliance, it'll report a "Degraded" state until 3 controllers are deployed.

Once installed, the NSX ALB objects should appear under Networking -> Advanced Load Balancer:

At this point, NSX Datacenter -> NSX ALB is integrated, but not ALB -> Data Center. The next step is to configure an NSX-T cloud. I've covered the procedure for configuring an NSX-T cloud here:

Note: Using a non-default CA Certificate for NSX ALB here will break the integration. It can be reverted back by reverting the certificate, and there doesn't appear to be an obvious way to change that yet. A Principal Identity is formed for the connection between systems, indicating that the feature is just not fully exposed to users yet.

Viewing Services

A cursory review of the new ALB section indicates that existing vIPs don't appear via the ALB GUI, but the inverse is true. Let's try and build one for Jenkins! The constructs are essentially the same as with the ALB UI, but the process is considerably simpler:
First, create a vIP
Then, create the pool:

Finally, we will create the virtual service. Note: nullPointerException seems to mean that the SE Group is incorrect, and may need to be manually resolved on the ALB controller.

Unlike most VMware products, NSX Data Center seems to handle multi-write (changes from BOTH the ALB and the Manager) fairly well.

That's it!

Footnote: To use custom TLS profiles, it must be invoked via the API only. I am building a method to manage that here:

Wednesday, December 22, 2021

NSX-T 3.2 and NSX ALB (Avi) Deployment Error - "Controller is not reachable. {0}"

NSX-T 3.2 has been released, and has a ton of spiffy features. The NSX ALB integration is particularly neat, but while repeatedly (repeatably) breaking the integration to learn more about it, I ran into this error:

When deploying NSX ALB appliances from the NSX Manager, it's very important to keep the NSX ALB Controller appliances where NSX Manager can see them. In addition, the appliances must exist on the same Layer 2 Segment

This post is not about the integration, however.

The following error:

NSX Advanced Load Balancer Controller is not reachable {0}

Indicates that NSX-T has orphaned appliances. NSX-T has API invocations for cleaning this up, but not GUI integrations. This is similar to other objects, and is because programmatic checking should be used to allow this work to be reliable.

To fix this, we must perform the following steps:

  • Get the list of NSX ALB appliances, if there isn't any, exit
  • Iterate through the list of appliances, prompting the user to delete
  • After deleting, check to make sure that it was deleted

The first step for any API invocations should be consulting the documentation. The NSX ALB Appliance management section is After researching the procedure, I found the following endpoints:

Performing this procedure with programmatic interfaces is a good example of when to use APIs - the task is well defined, the results are easy to test, and work to prevent user mistakes is rewarding.

TL;DR - I wrote the code here, integrating it with the REST client:

Saturday, December 4, 2021

VyOS and other Linux builds unable to use `vmxnet3` or "VMware Paravirtual SCSI" adapter on vSphere

Have you seen this selector when building machines on vSphere?

This causes some fairly common issues in NOS VMs, as most don't really know what distribution the NOS is based on.

"Guest OS Version" doesn't just categorize your workload, though. Selecting "Other Linux" instructs vSphere to maximize compatibility and ensure the VI admin receives a reliable deployment - which means it'll run some pretty old virtual hardware.

VMware curates its lifecycle "Guest OS" settings here. Note that "Other" isn't described:

Two commonly preferred settings for virtual hardware aren't available with this particular OS setting, and they both cause potential performance issues:
  • LSI Logic Virtual SCSI
  • Intel E1000 NIC <---If you're wondering, it will drop your VM's throughput
Let's cover how we'd fix this in vSphere 7 with a VM. The example in this procedure is VyOS 1.4.

Updating Paravirtualized Hardware

First, let's change the Guest OS version to something more specific. Generally, Linux distributions fall under two categories, Red-Hat, and Debian derivatives - Gentoo/Arch users won't be covered here because they should be able to find their own way out.

Since we know VyOS is a well-maintained distribution, I'll change it to "Debian 11." While this is technically lying, we're trying to provide a reference hardware version to the virtual machine, not accurately represent the workload. This menu can be reached by selecting "edit VM" on the vSphere console:

Second, let's change the SCSI Adapter:

Replacing network adapters will take a little bit more work. Re-typing existing interfaces is not currently supported in vSphere 7, so we'll need to delete and re-create. In this example, we can set a static MAC address so that the guest distribution can correlate the new adapter to the same interface by setting the MAC Address field to static. Since I'm life cycling a VM template, I don't want to do that!

If you're editing an existing VM, make a backup. If it's a NOS, export the configuration. There is no guarantee that the configurations will port over perfectly, and you will want a restore point. Fortunately, lots of options exist in the VMware ecosystem to handle this!

Refactoring / Recovering from the change

With my template VM, the only issues presented were that the interface re-numbered and the VRF needed to be re-assigned:

set interfaces ethernet eth2 address dhcp
set interfaces ethernet eth2 vrf mgmt

Since we have the VM awake and in non-template-form, we can update the NOS too. (Guide here:
vyos@vyos:~$ add system image vrf mgmt
Trying to fetch ISO file from
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 436M 100 436M 0 0 12.0M 0 0:00:36 0:00:36 --:--:-- 11.7M
ISO download succeeded.
Checking SHA256 (256-bit) checksum...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 106 100 106 0 0 215 0 --:--:-- --:--:-- --:--:-- 215
Found it. Verifying checksum...
SHA256 checksum valid.
Checking for digital signature file...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (22) The requested URL returned error: 404
Unable to fetch digital signature file.
Do you want to continue without signature check? (yes/no) [yes]
Checking MD5 checksums of files on the ISO image...OK.
What would you like to name this image? [1.4-rolling-202112040649]:
OK. This image will be named: 1.4-rolling-202112040649
Installing "1.4-rolling-202112040649" image.
Copying new release files...
Would you like to save the current configuration
directory and config file? (Yes/No) [Yes]: Yes
Copying current configuration...
Would you like to save the SSH host keys from your
current configuration? (Yes/No) [Yes]: No
Running post-install script...
Setting up grub configuration...
vyos@vyos:~$ show system image
The system currently has the following image(s) installed:

1: 1.4-rolling-202112040649 (default boot)
2: 1.4-rolling-202103130218 (running image)

vyos@vyos:~$ reboot
Are you sure you want to reboot this system? [y/N] y


To cover the major points of this article:
  • Selecting "Guest OS" in vSphere can present significant performance improvements or problems depending on what you choose. The selector chooses what PV hardware to provide to a VM, and it'll try to preserve compatibility and be conservative.
  • VM Hardware is a separate knob entirely, updating it won't make the newer hardware available without the "Guest OS" selector
  • Consult your NOS vendor on what to select here, if applicable, and require them to provide you documentation on why.
Some additional tangential benefits are present as a result of this change. For example, VM power actions work:

Since we're done, let's check this change into the image library:

Wednesday, November 24, 2021

Why Automate? Reliability Approaches with the VMware NSX-T API

Why should an infrastructure engineer leverage REST APIs?

I'm sure most IT workers have at least heard of REST APIs, or heard a sales pitch where a vendor insists that while a requested functionality doesn't exist, you could build it yourself by "using the API".

Or, participate in discussions where people seemed to try and offer you a copy of The DevOps Handbook or The Unicorn Project.

They're right, but software development and deployment methods have completely different guiding values than infrastructure management. Speed of delivery is almost completely worthless with infrastructure, where downtime is typically the only metric that infrastructure is evaluated on.

We need to transform the industry.

The industry has proven that "value-added infrastructure" is a thing that people want, otherwise, services like Amazon AWS, Azure, Lumen would not be profitable. Our biggest barrier to success right now is the perceptions around reliability because there clearly is demand for what we'd call abstraction of infrastructure. We can't move as slow as we used to, but we can't make mistakes either.

Stuck between a rock and a hard place?

I have some good news - everybody's just figuring this out as they go, and you don't have to start by replacing all of your day-to-day tasks with Ansible playbooks. Let's use automation tools to ensure Quality First, Speed Second. Machines excel at comparison operators, allowing an infrastructure administrator to test every possible aspect of infrastructure when executing a change. Here are some examples where I've personally seen a need for automation:
  • Large-scale routing changes: if 1,000 routes successfully migrate, and a handful of routes fail, manual checks tend to depend overly (unfairly) on the operator to eyeball the entire lot
    • Check: Before and after routes, export a difference
    • Check: All dynamic routing peers, export a difference
    • Reverse the process if anything fails
  • Certificate renewals
    • Check: If certificate exists
    • Check: If the certificate was uploaded
    • Check: If the certificate has a valid CA chain
    • Check: If the certificate was successfully installed
    • Reverse the process if anything fails
  • Adding a new VLAN or VNI to a fabric
    • Check: VLAN Spanning-Tree topology, export a difference
    • Check: EVPN AFI Peers, export a difference
    • Check: MAC Address Table, export a difference
    • Reverse the process if anything fails
The neat thing about this capability is the configuration reversal - API calls are incredibly easy to process in common programming languages (particularly compared to expect) and take fractions of a second to run - so if a tested process (it's easy to test, too!) does fail, reversion is straightforward. Let's cover the REST methods before exploring the deeper stuff like gNMI or YANG.

Anatomy of a REST call

When implementing a REST API call, a client request will have several key components:
  • Headers: Important meta-data about your request go here, the server should adhere to any specification provided in HTTP headers. If you're building API code or otherwise, I'd recommend just setting up a standard when reviewing the list of supported fields. Examples:
    • Authentication Attributes
    • {'content-type': 'application/xml'}
    • {'content-type': 'application/json'}
    • {'Accept-Encoding': 'application/json'}
    • {'Cache-Control': 'no-cache'}
  • Resource: This is specified by the Uniform Resource Indicator, the URL component after the system is specified. A resource is the "what" of a RESTful interaction.
  • Body: Free-form optional text, this component provides a payload for the API call. It's important to make sure that the server actually wants it!
  • Web Application Firewalls (WAF) can inspect header, verb, and body to determine if an API call is safe and proper. 
When implementing a REST API call, a server response will have several key components:
  • Headers: Identical use case, but keep in mind that headers from server to client will be following a different list.
  • Response Code: This should provide detail on the status of the API call.
    • In network automation, I strongly discourage simply trusting the response code as a means of testing for changes. It's better to make multiple GET requests to verify that the change was executed and provided the intended effects.
    • If implementing API-specific code, vendors will provide what each error code means specifically to them. Python supports constructing a dictionary with numeric indexes, a useful mechanism for mapping the vendor list, ex:
      • httperrors = {
        1: ('Unknown Command', 'The specific config or operational command is not recognized.'),
        2: ('Internal Error', 'Check with technical support when seeing these errors.'),
        3: ('Internal Error', 'Check with technical support when seeing these errors.'),
        4: ('Internal Error', 'Check with technical support when seeing these errors.'),
        5: ('Internal Error', 'Check with technical support when seeing these errors.')
  • Body: Ideally used for any additional detail on why the response provided executed with the status provided, but not mandatory.


    In a REST API, it's important to specify the TYPE of change you intend to make prior to actually invoking it. F5 Administrators will be familiar with this, with actions like tmsh create. We have 4 major REST verbs:
    • Create
    • Read
    • Modify/Update
    • Delete
    When you use a particular transport, you need to implement these verbs in a method native to that transport. This is significant when using other remote command methods like SSH (tmsh does this) or NetCONF or RESTCONF, all of which need a different method to implement.

    Fortunately for us, HTTP 1.1 seems like it's been made for this! HTTP has plenty of verbs that match the above, here's a brief decoder ring.
    • GET: READ-only request, typically does not include a message body. 
      • This will normally use a URI to specify what details you want to grab.
      • Since you're "getting" information here, typically you'd want to JSON pretty-print the output
    • POST: CREATE request, if you're making a new object on a remote system a message body is typically required and POST conveniently supports that.
      • POST should not overwrite existing data, but REST implementations vary!
    • POST: READ request, occasionally used when a query requires a message body. 
      • URIs don't always cut it when it comes to remote filtered requests or complex multi-tier queries.
      • Cisco NX-API avoids GET as a READ verb, and primarily uses POST instead with the REST verbs in the body
    • PUT: UPDATE request, is idempotent. Generally does not contain a lot of change safety, as it will implement or fully replace an object. 
      • Situations definitely exist that you want to be idempotent, and this is the verb for that.
      • Doesn't require a body
    • PATCH: MODIFY request, will only modify an existing object.
      • This will take considerably more work to structure, as PATCH can optionally be safely executed, but the responsibility for assembling requests safely in this manner is on the developer.
      • Most API implementations simply use POST instead and implement change safety in the back-end.
    • DELETE: DELETE request, does exactly what it sounds like, it makes a resource disappear.
    Nota Bene: None of this is a mandatory convention, so vendors may implement deviations from the REST spec. For example, Palo Alto will use XML and 0-100 series HTTP codes.

    Executing a REST Call

    Once the rules are set, the execution of a REST call is extremely easy, here's an example:
    curl -k -u admin
    Enter host password for user 'admin':
    "results" : [ {
    "id" : "3e79618a-c89e-477b-8872-f4c87120585b",
    "feature_name" : "certificates",
    "event_type" : "certificate_expiration_approaching",
    "feature_display_name" : "Certificates",
    "event_type_display_name" : "Certificate Expiration Approaching",
    "summary" : "A certificate is approaching expiration.",
    "description" : "Certificate 5c9565d8-2cfa-4a28-86cc-e095acba5ba2 is approaching expiration.",
    "recommended_action" : "Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the
    following NSX API POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id> where <cert-id> is the ID of a valid certificate reported by the GET /api/v1/trust-management/certificates NS
    X API. Once the expiring certificate is no longer in use, it should be deleted by invoking the DELETE /api/v1/trust-management/certificates/5c9565d8-2cfa-4a28-86cc-e095acba5ba2 NSX API.",
    "node_id" : "37e90542-f8b8-136e-59bc-5dd3b79b122b",
    "node_resource_type" : "ClusterNodeConfig",
    "entity_id" : "5c9565d8-2cfa-4a28-86cc-e095acba5ba2",
    "last_reported_time" : 1637510695463,
    "status" : "OPEN",
    "severity" : "MEDIUM",
    "node_display_name" : "nsx",
    "node_ip_addresses" : [ "" ],
    "reoccurrences_while_suppressed" : 0,
    "entity_resource_type" : "certificate_self_signed",
    "alarm_source_type" : "ENTITY_ID",
    "alarm_source" : [ "5c9565d8-2cfa-4a28-86cc-e095acba5ba2" ],
    "resource_type" : "Alarm",
    "display_name" : "3e79618a-c89e-477b-8872-f4c87120585b",
    "_create_user" : "system",
    "_create_time" : 1635035211215,
    "_last_modified_user" : "system",
    "_last_modified_time" : 1637510695464,
    "_system_owned" : false,
    "_protection" : "NOT_PROTECTED",
    "_revision" : 353
    Now - saving the cURL commands can be very administratively intensive - So I recommend some form of method to save and automate custom API calls. Quite a few more complex calls will require JSON payloads, variables, stuff like that.

    Executing a Procedure

    Planning the Procedure

    Here we'll use the API to resolve the following alarm. I'm going to use my own REST client, found here, because it's familiar. Let's write the desired result in pseudo-code first to develop a plan:

    • GET current cluster certificate ID
    • GET certificate store
    • PUT a replacement certificate with a new name
    • GET certificate store (validate PUT)
    • GET certificate ID (to further validate PUT). For idempotency, multiple runs should be supported.
    • POST update cluster certificate
    • GET current cluster certificate ID
    This process seems tedious, but computers don't ever get bored, and the objective here is to be more thorough than is reasonably feasible with manual execution! If you're thinking, "Gee, this is an awful lot of work!" trick rocks into doing it for you.

    Let's Trick Those Rocks

    Some general guidelines when scripting API calls:
    • Use a familiar language. An infrastructure engineer's goal with automation is reliability. Hiring trends, hipster cred, don't matter here. If you do best with a slide rule, use that.
    • Use libraries. An infrastructure engineer's goal with automation is reliability. Leverage libraries with publicly available testing results.
    • Log and Report: An infrastructure engineer's goal with automation is reliability. Report every little thing your code does to your infrastructure, and test code thoroughly.
    In this case, I published a wrapper for Python requests that allows me to save API settings here, and built a script on that library. Install it first:
    python3 -m pip install restify-ENGYAK

    From here, it's important to research the API calls required for this procedure (good thing we have the steps!). For NSX-T, the API Documentation is available here:

    NSX-T's Certificate management API also has a couple of quirks, where the Web UI and the API leverage different certificates. It's outlined here:

    Since we're writing code for reliability

    I'd like to outline a rough idea of where my time investment was for this procedure. I hope it helps because the focus really isn't on writing code.
    • 50%: Testing and planning testing. I used Jenkins CI for this, and I'm not the most capable with it. This effort reduces over time, but does not reduce importance! Write your test cases before everything!
    • 30%: Research. Consulting the VMware API docs and official documentation was worth every yoctosecond - avoiding potential problems with planned work is critical (and there were some major caveats with the API implementation).
    • 10%: Updating the parent library, setting up the python environment. Most of this work is 100% re-usable.
    • 5%: Managing source code, Git branching, basically generating a bread-crumb trail for the implementation for when I don't remember it.
    • 5%: Actually writing code!
    I'm saving useful API examples in my public repository:

    The Code

    # JSON Parsing tool
    import json
    # Import Restify Library
    from restify.RuminatingCogitation import Reliquary
    # Import OS - let's use this for passwords and usernames
    # APIUSER = Username
    # APIPASS = Password
    import os
    api_user = os.getenv("APIUSER")
    api_pass = os.getenv("APIPASS")
    # Set the interface - apply from variables no matter what
    cogitation_interface = Reliquary(
        "settings.json", input_user=api_user, input_pass=api_pass
    # Build Results Dictionary
    stack = {
        "old_cluster_certificate_id": False,
        "old_certificate_list": [],
        "upload_result": False,
        "new_certificate_id": False,
        "new_certificate_list": [],
        "new_cluster_certificate_id": False,
    # GET current cluster certificate ID
    stack["old_cluster_certificate_id"] = json.loads(
    # GET certificate store
    for i in json.loads(cogitation_interface.namshub("get_cluster_certificates"))[
    # We need to compare lists, so let's sort it first
    # PUT a replacement certificate with a new name
    print(cogitation_interface.namshub("put_certificate", namshub_variables="cert.json"))
    # GET certificate store (validate PUT)
    for i in json.loads(cogitation_interface.namshub("get_cluster_certificates"))[
    # We need to compare lists, so let's sort it first, then make it the difference between new and old
    stack["new_certificate_list"] = list(
        set(stack["new_certificate_list"]) - set(stack["old_certificate_list"])
    # Be Idempotent - this may be run multiple times, and should handle it accordingly.
    if len(stack["new_certificate_list"]) == 0:
        stack["new_certificate_id"] = input(
            "Change not detected! Please select a certificate to replace with: "
        stack["new_certificate_id"] = stack["new_certificate_list"][0]
    # GET certificate ID (to further validate PUT)
            namshub_variables=json.dumps({"id": stack["new_certificate_id"]}),
    # POST update cluster certificate
            namshub_variables=json.dumps({"id": stack["new_certificate_id"]}),
            namshub_variables=json.dumps({"id": stack["new_certificate_id"]}),
    # GET current cluster certificate ID
    stack["new_cluster_certificate_id"] = json.loads(
    # Show the results
    print(json.dumps(stack, indent=4))

    Sunday, October 10, 2021

    Get rid of certificate errors with Avi (NSX-ALB) and Hashicorp Vault!

     Have you ever seen this error before?

    This is a really important issue in enterprise infrastructure because unauthenticated TLS connections teach our end users to be complacent and ignore this error. 

    TLS Authentication

    SSL/TLS for internal enterprise administration typically only addresses the confidentiality aspects of an organizational need - yet the integrity aspects are not well realized:

    This is an important aspect of our sense of enterprise security, but the level of effort to authenticating information endpoints is high for TLS, so we make do with what we have. 

    The practice of ignoring authentication errors for decades has promoted complacency

    Here's another error that enterprise systems administrators see all the time:

    ssh {{ ip }}
    The authenticity of host '{{ ip }} ({{ ip }})' can't be established.
    RSA key fingerprint is SHA256:{{ hash }}.
    Are you sure you want to continue connecting (yes/no)?

    This probably looks familiar too - Secure Shell (SSH) follows a different method of establishing trust, where the user should verify that hash is correct by some method, and if it changes, it'll throw an error that we hopefully don't ignore:

    ssh {{ ip }}
    Someone could be eavesdropping on you right now (man-in-the-middle attack)!
    It is also possible that a host key has just been changed.
    The fingerprint for the RSA key sent by the remote host is
    SHA256:{{ hash }}.
    Please contact your system administrator.
    Add correct host key in known_hosts to get rid of this message.
    Offending ECDSA key in known_hosts
    {{ cipher }} host key for {{ ip }} has changed and you have requested strict checking.
    Host key verification failed.

    SSH is performing something very valuable here - authentication. By default, SSH will record a node's SSH hash in a file called known_hosts to ensure that the server is in fact the same as the last time you accessed it. In turn, once the server authenticates, you provide some level of authentication (user, key) afterward to ensure that you are who you say you are too. Always ensure that the service you're giving a secret to (like your password!) is authenticated or validated in some way first!

    Web of Trust versus Centralized Identity

    Web-of-Trust (WoT)

    Web-of-Trust (WoT) is typically the easiest form of authentication scheme to start out with but results in factorial scaling issues later on if executed properly. In this model, it's on the individual to validate identities from each peer they interact with since WoT neither requires nor wants a centralized authority to validate against.

    Typically enterprises use WoT because it's baked into a product, not specifically due to any particular need. Certain applications work well with it - so generally you should:
    • Keep your circle small
    • Replace crypto regularly
    • Leverage multiple identities for multiple tasks
      • e.g. separate your code signing keys from your SSH authentication keys
    • Easy initial set-up
    • Doesn't depend on a third party to establish infrastructure
    • The user is empowered to make both good and bad decisions, and the vast majority of users don't care enough about security to maintain vigilance
    • If you're in an organization with hundreds of "things to validate", you have to personally validate a lot of keys
    • It's a lot of work to properly validate - Ex. You probably don't ask for a person's ID every time you share emails with them
    • Revocation: If a key is compromised, you're relying on every single user to revoke it (or renew it, change your crypto folks) in a timely manner. This is a lot of work depending on how much a key is used.
    Examples: SSH, PGP

    Centralized Identity

    Centralized Identity services are the sweetheart of large enterprises. Put your security officers in charge of one of these babies and they'll make it sing

    In this model, it's on the Identity Administrator to ensure the integrity of any Identity Store. They'll typically do quite a bit better than your average WoT user because it's their job to do so.

    Centralized Identity services handle routine changes like ID refreshes/revocations much more easily with dedicated staffing - mostly because the application and maintainer are easy to define. But here's the rub, you have to be able to afford it. Most of the products that fit in this category are not free and require at least part-time supervision by a capable administrator.

    It's not impossible, though. One can build centralized authentication mechanisms with open source tooling, it just takes work. If you aren't the person doing this work, you should help them by being a vigilant user - if an identity was compromised, report it quickly, even if it was your fault - the time to respond here is vital. Try to shoulder some of this weight whenever you can - it's an uphill hike for the people doing it and every little contribution counts.

    Back to TLS and Certificates

    In the case of an Application Delivery Administrator, an individual is responsible for the integrity and confidentiality of the services they deliver. This role must work hand-in-glove with Identity administrators in principle, and both are security administrators at heart.

    This is really just a flowery way to say "get used to renewing and filing Certificate Signing Requests (CSRs)".

    In an ideal world, an Application Delivery Controller (ADC) will validate the integrity of a backend server (Real Server) before passing traffic to it, in addition to providing the whole "CIA Triad" to clients. Availability is an ADC's thing, after all.

    Realistically an ADC Administrator will only control one of these two legs - and it's plenty on its own. Here's one way to execute this model.

    Certificate Management

    Enough theory, let's do some things. First, we'll build a PKI inside of HashiCorp Vault - this assumes a full Vault installation. Here's a view of the planned Certificate Hierarchy:

    From the HashiCorp Vault GUI - let's set up a PKI secrets engine for the root CA:

    Note: Default duration is 30 days, so I've overridden this by setting the default and max-lifetime under each CA labeled as "TTL"
    Let's create the services and user CAs:

    This will provide a CSR - we need to sign it under the root CA:

    Copy the resulting certificate into your clipboard - these secrets engines are autonomous, and don't interoperate - so we'll have to install it into the intermediate CA.
    We install the certificate via the "Set signed intermediate" button in Vault:
    Now, we have a hierarchical CA!
    NB: You will need to create a Vault "role" here - 
    Mega NB: The root CA should nominally be "offline" and at a minimum part of a separate Vault instance!

    For this post, we'll just be issuing certificates manually. We need to extract the intermediate and root certificates to install in NSX ALB and participating clients. These can be pulled from the root-ca module:
    Note: Vault doesn't come with a certificate reader as of 1.8.3. You can read these certificates with online tools or by performing the following command with OpenSSL:
    openssl x509 -in cert1.crt -noout -text

    Once we have the files, let's upload them to Avi:
    For each certificate, click "Root/Intermediate CA Certificate" and Import. Note that you do need to click on Validate before importing.

    Now that we have the CA available, we should start by authenticating Avi itself and create a controller certificate:
    Fulfilling the role of PKI Administrator, let's sign the CSR after verifying authenticity.
    Back to the role of Application Administrator! We've received the certificate, let's install it in the Avi GUI!
    Once we've verified the certificate is healthy, let's apply it to the management plane under Administration -> Settings -> Access Settings:
    At this point, we'll need to trust the root certificate created in Vault - else we'll still see certificate errors. Once that's done, we'll be bidirectionally authenticated with the Avi controller!

    From here on out - we'll be able to  leverage the same process, in short:
    • Under Avi -> Templates -> Security -> TLS/SSL Certificates, create a new Application CSR
      • Ensure that all appropriate Subject Alternative Names (SANs) are captured!
    • Under Vault -> svc-ca -> issued-certificates -> Sign Certificate
    • Copy issued certificate to TLS Certificate created in the previous step
    • Assign to a virtual service. Unlike F5 LTM, this is decoupled from the clientssl profile.

    Sunday, September 19, 2021

    Get an A on with VMware Avi / NSX ALB (and keep it that way with SemVer!)

    Cryptographic security is an important aspect of hosting any business-critical service.

    When hosting a public service secured by TLS, it is important to strike a balance between compatibility (The Availability aspect of CIA), and strong cryptography (the Integrity or Authentication and Confidentiality aspects of CIA). To illustrate, let's look at the CIA model:

    In this case, we need to balance backward compatibility with using good quality cryptography -  here's a brief and probably soon-to-be-dated overview of what we ought to use and why.


    This block is fairly easy, as older protocols are worse, right? 

    TLS 1.3

    As a protocol, TLS 1.3 has quite a few great improvements and is fundamentally simpler to manage with fewer knobs and dials. There is a major concern with TLS 1.3 currently - security tooling in the large enterprise hasn't caught up with this protocol yet as new ciphers like ChaCha20 don't have hardware-assisted lanes for decryption. Here are some of the new capabilities you'll like::
    • Simplified Crypto sets: TLS 1.3 deprecates a ton of less-than-secure crypto - TLS 1.2 supports up to 356 cipher suites, 37 of which are new with TLS 1.2. This is a mess - TLS 1.3 supports five.
      • Note: The designers for TLS 1.3 achieved this by removing forward secrecy methods from the cipher suite, and they must be separately selected.
    • Simplified handshake: TLS 1.3 connections require fewer round-trips, and session resumption features allow a 0-RTT handshake.
    • AEAD Support: AEAD ciphers both support integrity and confidentiality. AES Galois Counter Mode (GCM) and Google's ChaCha20 serve this purpose.
    • Forward Secrecy: If a cipher suite doesn't have PFS (I disagree with perfect) support, it means that a user can capture your network traffic and replay it to decrypt if the private keys are acquired. PFS support is mandatory in TLS 1.2

    Here are some of the things you can do to mitigate the risk if you're in a large enterprise that performs decryption:
    • Use a load balancer - since this is about a load balancer, you can protect your customer's traffic in transit by performing SSL/TLS bridging. Set the LB-to-Server (serverssl) profile to a high-efficiency cipher suite (TLS 1.2 + AES-CBC) to maintain confidentiality while still protecting privacy.

    TLS 1.2

    TLS 1.2 is like the Toyota Corolla of TLS, it's run for forever and not everyone maintains it properly.

    It can still perform well if properly configured and maintained - we'll go into more detail on how in the next section. The practices outlined here are good for all editions of TLS.

    Generally, TLS 1.0 and 1.1 should not be used. Two OS providers (Windows XP, Android 4, and below) were disturbingly slow to adopt TLS 1.2, so if this is part of your customer base, beware.


    This information is much more likely to be dated. I'll try to keep this short:


    • (AEAD) AES-GCM: This is usually my all-around cipher. It's decently fast and supports partial acceleration with hardware ADCs / CPUs. AES is generally pretty fast, so it's a good balance of performance and confidentiality. I don't personally think it's worth running anything but 256-bit on modern hardware.
    • (AEAD) ChaCha20: This was developed by Google, and is still "being proven". Generally trusted by the public, this novel cipher suite is fast despite a lack of hardware acceleration.
    • AES-CBC: This has been the "advanced" cipher for confidentiality before AES-GCM. Developed in 1993, this crypto is highly performant and motivated users to move from suites like DES and RC4 by being both more performant and stronger. Like with AES-GCM, I prefer not to use anything but 256-bit on modern hardware
    • Everything else: This is the "don't bother" bucket: RC4, DES, 3DES


    Generally, AEAD provides an advantage here - SHA3 isn't generally available yet but SHA2 variants should be the only thing used. The more bits the better!

    Forward Secrecy

    • ECDHE (Elliptic Curve Diffie Hellman): This should be mandatory with TLS 1.2 unless you have customers with old Android phones and Windows XP.
    • TLS 1.3 lets you select multiple PFS algorithms that are EC-based.

    Matters of Practice

    Before we move into the Avi-specific configuration, I have a recommendation that is true for all platforms:
    Cryptography practices change over time - and some of these changes break compatibility. Semantic versioning provides the capability to support three scales of change:
    • Major Changes: First number in a version. Since the specification is focused on APIs, I'll be more clear here. This is what you'd iterate if you are removing cipher suites or negotiation parameters that might break existing clients
    • Minor Changes: This category would be for tuning and adding support for something new that won't break compatibility. Examples here would be cipher order preference changes or adding new ciphers.
    • Patch Changes: This won't be used much in this case - here's where we'd document a change that matches the Minor Change's intent, like mistakes on cipher order preference.

    Let's do it!

    Let's move into an example leveraging NSX ALB (Avi Vantage). Here, I'll be creating a "first version," but the practices are the same. First, navigate to Templates -> Security -> SSL/TLS Profile:

    Note: I really like this about Avi Vantage, even if I'm not using it here. The security scores here are accurate, albeit capped out - VMware is probably doing this to encourage use of AEAD ciphers:
    ...but, I'm somewhat old-school. I like using Apache-style cipher strings because they can apply to anything, and everything will run TLS eventually. Here are the cipher strings I'm using - the first is TLS 1.2, the second is TLS 1.3.

    One gripe I have here is that Avi won't add the "What If" analysis like F5's TM-OS does (14+ only).  Conversely, applying this profile is much easier. To do this, open the virtual service, and navigate to the bottom right:

    That's it! Later on, we'll provide examples of coverage reporting for these profiles. In a production-like deployment, these services should be managed with release strategies given that versioning is applied.

    Friday, September 17, 2021

    Static IPv4/IPv6 Addresses - Debian 11

     Here's how to set both static IPv4 and IPv6 addressing on Debian 11. The new portions are outlined in italics.

    First, edit /etc/network/interfaces

    auto lo
    auto ens192
    iface lo inet loopback
    # The primary network interface
    allow-hotplug ens192
    iface ens192 inet static
        address {{ ipv4.address }}
        gateway {{ ipv4.gateway }}
    iface ens192 inet6 static
        address {{ ipv6.address }}
        gateway {{ ipv6.gateway }}

    Then, restart your networking stack:
    systemctl restart networking

    Friday, September 10, 2021

    VMware NSX ALB (Avi Networks) and NSX-T Integration, Installation

    Note: I created a common baseline for pre-requisites in this previous post. We'll be following VMware's Avi + NSX-T Design guide.

    This will be a complete re-install. Avi Vantage appears to develop some tight coupling issues with using the same vCenter for both Layer 2 and NSX-T deployments - which is not an issue that most people will typically have. Let's start with the OVA deployment:

    Initial setup here will be very different compared to a typical vCenter standalone or read-only deployment. The setup wizard should be very minimally followed:

    With a more "standard" deployment methodology, the Avi Service Engines will be running on their own Tier-1 router, and leveraging Source-NAT (misnomer, since it's a TCP proxy) for "one-arm load balancing":

    To perform this, we'll need to add two segments to the ALB Tier-1. one for management, and one for vIPs. I have created the following NSX-T segments, with running DHCP and for vIPs:
    Note: I used underscores in this segment name, in my own testing both ./ are illegal characters. Avi's NSX-T Cloud Connector will report "No Transport Nodes Found" if it cannot match the segment name due to these characters.
    Note: If you configure an NSX-T cloud and discover this issue, you will need to delete and re-add the cloud after fixing the names!
    Note: IPv6 is being used, but I will not share my globally routable prefixes.

    First off, let's create NSX-T Manager and vCenter Credentials:
    There is one thing that needs to be created on vCenter as well - a content library. Just create a blank one and label it accordingly, then proceed with the following steps:
    Click Save, and get ready to wait. The Avi controller has automated quite a few steps, and it will take a while to run. If you want, the way to track any issue in NSX ALB is to navigate to Operations -> Events -> Show Internal:
    Once the NSX Cloud is reporting as "Complete" under Infrastructure -> Dashboard, we need to specify some additional data to ensure that the service engines will deploy. To do this, we navigate to Infrastructure -> Cloud Resources -> Service Engine Groups, and select the Cloud:
    Then let's build a Service Engine Group. This will be the compute resource attached to our vIPs. Here I configured a naming convention and a compute target - and it can automatically drop SEs into a specific folder.
    The next step here is to configure the built-in IPAM. Let's add an IP range under Infrastructure -> Cloud Resources -> Networks by editing the appropriate network ID. Note that you will need to select the NSX-T cloud to see the correct network:
    Those of you who have been LTM Admins will appreciate this. Avi SE also perform "Auto Last Hop," so you can reach a vIP without a default route, but monitors (health checks) will fail. The spot to configure the custom routes is under Infrastructure -> Cloud Resources -> Routing:

    Finally, let's verify that the NSX-T Cloud is fully configured. An interesting thing I saw here is that Avi 21 shows an unconfigured or "In Progress" cloud as green now, so we'll have to mouse over the cloud status to check in on it. 
    Now that everything is configured (at least in terms of infrastructure), Avi will not deploy Service Engines until there's something to do! So let's do that:
    Let's define a pool (back-end server resources):

    Let's set a HTTP-to-HTTPS redirect as well:

    Finally, let's make sure that the correct SE group is selected:
    And that's it! You're up and running with Avi Vantage 21! After a few minutes, you should see deployed service engines:
    The service I configured is also now up - In this case, I'm using Hyperglass, and I can leverage the load-balanced vIP to check and see what the route advertisement from Avi looks like. As you can see, it's firing a multipath BGP host address:

    Popular Posts