Sunday, December 26, 2021

The winds of change in cloud operations, and why integrations like NSX Data Center 3.2 + Advanced Load Balancer are important

The Jetstreams

Cloud operators now provide two completely different classes of service to customers:

  • Self-Service, VMs, Operating System Templates
    • Generally mature, some private cloud operators are smoothing out CMPs or such, but work as intended from a customer perspective
    • Bringup is automated
    • Operating System level configuration is usually automated
    • Application-level configuration is not always automated or managed as code
    • Cloud Provider typically will hold responsibility for a widget working
  • Containers, Service Definitions, no GUI
    • Kubernetes fits squarely here, but other services exist
    • Not the most customer friendly, nascent 
    • Application Owner has to hold responsibility for a widget working

Infrastructure Engineers as Agents of Change

The industry cannot transition responsibility for "stuff working" to creative types (Web Developers, App Developers). Have you ever heard "it's the network"? How about "This must be a <issue I'm not responsible for>"?

This is a call for help. Once the current trends with Automation and reliability engineering slow down (Type 1 above), the second kind of automation is going to necessitate leveraging infrastructure expertise elsewhere. Services like Kubernetes both require a "distribution" of sorts, but there's nobody to blame when something fails to deploy.

Why This Matters

NSX-T's 3.2 release has provided a ton of goodies, with an emphasis on centralized management and provisioning. We're starting to see tools that will potentially support multiple inbound declarative interfaces to achieve similar types of work, and NSX Data Center Manager has all the right moving parts to provide that.

NSX ALB's Controller interface provides comprehensive self-service and troubleshooting information, and a "Lite" service portal.

NSX Datacenter + ALB presents a really unique value set, with one provisioning point for all services, in addition to the previously provided Layer 3 fabric integration. It's good to see this type of write-many implementation

Let's try it out!

First, let's cover some prerequisites (detailed list here: https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.2/installation/GUID-3D1F107D-15C0-423B-8E79-68498B757779.html):

  • Greenfield Deployment only. This doesn't allow you to pick up existing NSX ALB installations
  • NSX ALB version 20.1.7+ or 21.1.2+ OVA
  • NSX Data Center 3.2.0+
  • NSX ALB and NSX Data Center controllers must exist on the same subnet
  • NSX Data Center and NSX ALB clusters should have a vIP configured
  • ...it also can't support third party CAs

Once these are met, the usual prerequisites also matter. 

Deployment is extremely straightforward, and managed under System -> Appliances. This wizard will require you to upload the OVA, so get that rolling before filling out any forms:


The NSX Manager will take care of the VM deployment for you. Interestingly enough, this will allow us to potentially get rid of tools like PyVmOmi and let us deploy everything with Ansible/Terraform someday.

Once it's done deploying the first appliance, it'll report a "Degraded" state until 3 controllers are deployed.

Once installed, the NSX ALB objects should appear under Networking -> Advanced Load Balancer:

At this point, NSX Datacenter -> NSX ALB is integrated, but not ALB -> Data Center. The next step is to configure an NSX-T cloud. I've covered the procedure for configuring an NSX-T cloud here: https://blog.engyak.co/2021/09/vmware-nsx-alb-avi-networks-and-nsx-t.html

Note: Using a non-default CA Certificate for NSX ALB here will break the integration. It can be reverted back by reverting the certificate, and there doesn't appear to be an obvious way to change that yet. A Principal Identity is formed for the connection between systems, indicating that the feature is just not fully exposed to users yet.

Viewing Services

A cursory review of the new ALB section indicates that existing vIPs don't appear via the ALB GUI, but the inverse is true. Let's try and build one for Jenkins! The constructs are essentially the same as with the ALB UI, but the process is considerably simpler:
 
First, create a vIP
Then, create the pool:

Finally, we will create the virtual service. Note: nullPointerException seems to mean that the SE Group is incorrect, and may need to be manually resolved on the ALB controller.

Unlike most VMware products, NSX Data Center seems to handle multi-write (changes from BOTH the ALB and the Manager) fairly well.

That's it!

Footnote: To use custom TLS profiles, it must be invoked via the API only. I am building a method to manage that here: https://github.com/ngschmidt/python-restify/tree/main/nsx-t/profiles/tls


Wednesday, December 22, 2021

NSX-T 3.2 and NSX ALB (Avi) Deployment Error - "Controller is not reachable. {0}"

NSX-T 3.2 has been released, and has a ton of spiffy features. The NSX ALB integration is particularly neat, but while repeatedly (repeatably) breaking the integration to learn more about it, I ran into this error:

When deploying NSX ALB appliances from the NSX Manager, it's very important to keep the NSX ALB Controller appliances where NSX Manager can see them. In addition, the appliances must exist on the same Layer 2 Segment

This post is not about the integration, however.

The following error:

NSX Advanced Load Balancer Controller is not reachable {0}

Indicates that NSX-T has orphaned appliances. NSX-T has API invocations for cleaning this up, but not GUI integrations. This is similar to other objects, and is because programmatic checking should be used to allow this work to be reliable.

To fix this, we must perform the following steps:

  • Get the list of NSX ALB appliances, if there isn't any, exit
  • Iterate through the list of appliances, prompting the user to delete
  • After deleting, check to make sure that it was deleted

The first step for any API invocations should be consulting the documentation. The NSX ALB Appliance management section is 3.7.1.4. After researching the procedure, I found the following endpoints:

Performing this procedure with programmatic interfaces is a good example of when to use APIs - the task is well defined, the results are easy to test, and work to prevent user mistakes is rewarding.

TL;DR - I wrote the code here, integrating it with the REST client: https://github.com/ngschmidt/python-restify/blob/main/nsx-t/nsxalb_deployment_cleanup.py

Saturday, December 4, 2021

VyOS and other Linux builds unable to use `vmxnet3` or "VMware Paravirtual SCSI" adapter on vSphere

Have you seen this selector when building machines on vSphere?

This causes some fairly common issues in NOS VMs, as most don't really know what distribution the NOS is based on.

"Guest OS Version" doesn't just categorize your workload, though. Selecting "Other Linux" instructs vSphere to maximize compatibility and ensure the VI admin receives a reliable deployment - which means it'll run some pretty old virtual hardware.

VMware curates its lifecycle "Guest OS" settings here. Note that "Other" isn't described: https://partnerweb.vmware.com/GOSIG/home.html

Two commonly preferred settings for virtual hardware aren't available with this particular OS setting, and they both cause potential performance issues:
  • LSI Logic Virtual SCSI
  • Intel E1000 NIC <---If you're wondering, it will drop your VM's throughput
Let's cover how we'd fix this in vSphere 7 with a VM. The example in this procedure is VyOS 1.4.

Updating Paravirtualized Hardware


First, let's change the Guest OS version to something more specific. Generally, Linux distributions fall under two categories, Red-Hat, and Debian derivatives - Gentoo/Arch users won't be covered here because they should be able to find their own way out.

Since we know VyOS is a well-maintained distribution, I'll change it to "Debian 11." While this is technically lying, we're trying to provide a reference hardware version to the virtual machine, not accurately represent the workload. This menu can be reached by selecting "edit VM" on the vSphere console:

Second, let's change the SCSI Adapter:

Replacing network adapters will take a little bit more work. Re-typing existing interfaces is not currently supported in vSphere 7, so we'll need to delete and re-create. In this example, we can set a static MAC address so that the guest distribution can correlate the new adapter to the same interface by setting the MAC Address field to static. Since I'm life cycling a VM template, I don't want to do that!

If you're editing an existing VM, make a backup. If it's a NOS, export the configuration. There is no guarantee that the configurations will port over perfectly, and you will want a restore point. Fortunately, lots of options exist in the VMware ecosystem to handle this!

Refactoring / Recovering from the change

With my template VM, the only issues presented were that the interface re-numbered and the VRF needed to be re-assigned:

set interfaces ethernet eth2 address dhcp
set interfaces ethernet eth2 vrf mgmt
commit
save

Since we have the VM awake and in non-template-form, we can update the NOS too. (Guide here: https://docs.vyos.io/en/latest/installation/update.html)
vyos@vyos:~$ add system image https://downloads.vyos.io/rolling/current/amd64/vyos-rolling-latest.iso vrf mgmt
Trying to fetch ISO file from https://downloads.vyos.io/rolling/current/amd64/vyos-rolling-latest.iso
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 436M 100 436M 0 0 12.0M 0 0:00:36 0:00:36 --:--:-- 11.7M
ISO download succeeded.
Checking SHA256 (256-bit) checksum...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 106 100 106 0 0 215 0 --:--:-- --:--:-- --:--:-- 215
Found it. Verifying checksum...
SHA256 checksum valid.
Checking for digital signature file...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (22) The requested URL returned error: 404
Unable to fetch digital signature file.
Do you want to continue without signature check? (yes/no) [yes]
Checking MD5 checksums of files on the ISO image...OK.
Done!
What would you like to name this image? [1.4-rolling-202112040649]:
OK. This image will be named: 1.4-rolling-202112040649
Installing "1.4-rolling-202112040649" image.
Copying new release files...
Would you like to save the current configuration
directory and config file? (Yes/No) [Yes]: Yes
Copying current configuration...
Would you like to save the SSH host keys from your
current configuration? (Yes/No) [Yes]: No
Running post-install script...
Setting up grub configuration...
Done.
vyos@vyos:~$ show system image
The system currently has the following image(s) installed:

1: 1.4-rolling-202112040649 (default boot)
2: 1.4-rolling-202103130218 (running image)

vyos@vyos:~$ reboot
Are you sure you want to reboot this system? [y/N] y

Recap

To cover the major points of this article:
  • Selecting "Guest OS" in vSphere can present significant performance improvements or problems depending on what you choose. The selector chooses what PV hardware to provide to a VM, and it'll try to preserve compatibility and be conservative.
  • VM Hardware is a separate knob entirely, updating it won't make the newer hardware available without the "Guest OS" selector
  • Consult your NOS vendor on what to select here, if applicable, and require them to provide you documentation on why.
Some additional tangential benefits are present as a result of this change. For example, VM power actions work:

Since we're done, let's check this change into the image library:
Reference: https://blog.engyak.co/2020/10/using-vm-templates-and-nsx-t-for.html

Popular Posts