Wednesday, November 24, 2021

Why Automate? Reliability Approaches with the VMware NSX-T API

Why should an infrastructure engineer leverage REST APIs?

I'm sure most IT workers have at least heard of REST APIs, or heard a sales pitch where a vendor insists that while a requested functionality doesn't exist, you could build it yourself by "using the API".

Or, participate in discussions where people seemed to try and offer you a copy of The DevOps Handbook or The Unicorn Project.

They're right, but software development and deployment methods have completely different guiding values than infrastructure management. Speed of delivery is almost completely worthless with infrastructure, where downtime is typically the only metric that infrastructure is evaluated on.

We need to transform the industry.

The industry has proven that "value-added infrastructure" is a thing that people want, otherwise, services like Amazon AWS, Azure, Lumen would not be profitable. Our biggest barrier to success right now is the perceptions around reliability because there clearly is demand for what we'd call abstraction of infrastructure. We can't move as slow as we used to, but we can't make mistakes either.

Stuck between a rock and a hard place?

I have some good news - everybody's just figuring this out as they go, and you don't have to start by replacing all of your day-to-day tasks with Ansible playbooks. Let's use automation tools to ensure Quality First, Speed Second. Machines excel at comparison operators, allowing an infrastructure administrator to test every possible aspect of infrastructure when executing a change. Here are some examples where I've personally seen a need for automation:
  • Large-scale routing changes: if 1,000 routes successfully migrate, and a handful of routes fail, manual checks tend to depend overly (unfairly) on the operator to eyeball the entire lot
    • Check: Before and after routes, export a difference
    • Check: All dynamic routing peers, export a difference
    • Reverse the process if anything fails
  • Certificate renewals
    • Check: If certificate exists
    • Check: If the certificate was uploaded
    • Check: If the certificate has a valid CA chain
    • Check: If the certificate was successfully installed
    • Reverse the process if anything fails
  • Adding a new VLAN or VNI to a fabric
    • Check: VLAN Spanning-Tree topology, export a difference
    • Check: EVPN AFI Peers, export a difference
    • Check: MAC Address Table, export a difference
    • Reverse the process if anything fails
The neat thing about this capability is the configuration reversal - API calls are incredibly easy to process in common programming languages (particularly compared to expect) and take fractions of a second to run - so if a tested process (it's easy to test, too!) does fail, reversion is straightforward. Let's cover the REST methods before exploring the deeper stuff like gNMI or YANG.

Anatomy of a REST call

When implementing a REST API call, a client request will have several key components:
  • Headers: Important meta-data about your request go here, the server should adhere to any specification provided in HTTP headers. If you're building API code or otherwise, I'd recommend just setting up a standard when reviewing the list of supported fields. Examples:
    • Authentication Attributes
    • {'content-type': 'application/xml'}
    • {'content-type': 'application/json'}
    • {'Accept-Encoding': 'application/json'}
    • {'Cache-Control': 'no-cache'}
  • Resource: This is specified by the Uniform Resource Indicator, the URL component after the system is specified. A resource is the "what" of a RESTful interaction.
  • Body: Free-form optional text, this component provides a payload for the API call. It's important to make sure that the server actually wants it!
  • Web Application Firewalls (WAF) can inspect header, verb, and body to determine if an API call is safe and proper. 
When implementing a REST API call, a server response will have several key components:
  • Headers: Identical use case, but keep in mind that headers from server to client will be following a different list.
  • Response Code: This should provide detail on the status of the API call.
    • In network automation, I strongly discourage simply trusting the response code as a means of testing for changes. It's better to make multiple GET requests to verify that the change was executed and provided the intended effects.
    • If implementing API-specific code, vendors will provide what each error code means specifically to them. Python supports constructing a dictionary with numeric indexes, a useful mechanism for mapping the vendor list, ex:
      • httperrors = {
        1: ('Unknown Command', 'The specific config or operational command is not recognized.'),
        2: ('Internal Error', 'Check with technical support when seeing these errors.'),
        3: ('Internal Error', 'Check with technical support when seeing these errors.'),
        4: ('Internal Error', 'Check with technical support when seeing these errors.'),
        5: ('Internal Error', 'Check with technical support when seeing these errors.')
        }
  • Body: Ideally used for any additional detail on why the response provided executed with the status provided, but not mandatory.

    Verb

    In a REST API, it's important to specify the TYPE of change you intend to make prior to actually invoking it. F5 Administrators will be familiar with this, with actions like tmsh create. We have 4 major REST verbs:
    • Create
    • Read
    • Modify/Update
    • Delete
    When you use a particular transport, you need to implement these verbs in a method native to that transport. This is significant when using other remote command methods like SSH (tmsh does this) or NetCONF or RESTCONF, all of which need a different method to implement.

    Fortunately for us, HTTP 1.1 seems like it's been made for this! HTTP has plenty of verbs that match the above, here's a brief decoder ring.
    • GET: READ-only request, typically does not include a message body. 
      • This will normally use a URI to specify what details you want to grab.
      • Since you're "getting" information here, typically you'd want to JSON pretty-print the output
    • POST: CREATE request, if you're making a new object on a remote system a message body is typically required and POST conveniently supports that.
      • POST should not overwrite existing data, but REST implementations vary!
    • POST: READ request, occasionally used when a query requires a message body. 
      • URIs don't always cut it when it comes to remote filtered requests or complex multi-tier queries.
      • Cisco NX-API avoids GET as a READ verb, and primarily uses POST instead with the REST verbs in the body
    • PUT: UPDATE request, is idempotent. Generally does not contain a lot of change safety, as it will implement or fully replace an object. 
      • Situations definitely exist that you want to be idempotent, and this is the verb for that.
      • Doesn't require a body
    • PATCH: MODIFY request, will only modify an existing object.
      • This will take considerably more work to structure, as PATCH can optionally be safely executed, but the responsibility for assembling requests safely in this manner is on the developer.
      • Most API implementations simply use POST instead and implement change safety in the back-end.
    • DELETE: DELETE request, does exactly what it sounds like, it makes a resource disappear.
    Nota Bene: None of this is a mandatory convention, so vendors may implement deviations from the REST spec. For example, Palo Alto will use XML and 0-100 series HTTP codes.

    Executing a REST Call

    Once the rules are set, the execution of a REST call is extremely easy, here's an example:
    curl -k -u admin https://nsx.lab.engyak.net/api/v1/alarms
    Enter host password for user 'admin':
    {
    "results" : [ {
    "id" : "3e79618a-c89e-477b-8872-f4c87120585b",
    "feature_name" : "certificates",
    "event_type" : "certificate_expiration_approaching",
    "feature_display_name" : "Certificates",
    "event_type_display_name" : "Certificate Expiration Approaching",
    "summary" : "A certificate is approaching expiration.",
    "description" : "Certificate 5c9565d8-2cfa-4a28-86cc-e095acba5ba2 is approaching expiration.",
    "recommended_action" : "Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the
    following NSX API POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id> where <cert-id> is the ID of a valid certificate reported by the GET /api/v1/trust-management/certificates NS
    X API. Once the expiring certificate is no longer in use, it should be deleted by invoking the DELETE /api/v1/trust-management/certificates/5c9565d8-2cfa-4a28-86cc-e095acba5ba2 NSX API.",
    "node_id" : "37e90542-f8b8-136e-59bc-5dd3b79b122b",
    "node_resource_type" : "ClusterNodeConfig",
    "entity_id" : "5c9565d8-2cfa-4a28-86cc-e095acba5ba2",
    "last_reported_time" : 1637510695463,
    "status" : "OPEN",
    "severity" : "MEDIUM",
    "node_display_name" : "nsx",
    "node_ip_addresses" : [ "10.66.0.204" ],
    "reoccurrences_while_suppressed" : 0,
    "entity_resource_type" : "certificate_self_signed",
    "alarm_source_type" : "ENTITY_ID",
    "alarm_source" : [ "5c9565d8-2cfa-4a28-86cc-e095acba5ba2" ],
    "resource_type" : "Alarm",
    "display_name" : "3e79618a-c89e-477b-8872-f4c87120585b",
    "_create_user" : "system",
    "_create_time" : 1635035211215,
    "_last_modified_user" : "system",
    "_last_modified_time" : 1637510695464,
    "_system_owned" : false,
    "_protection" : "NOT_PROTECTED",
    "_revision" : 353
    }
    Now - saving the cURL commands can be very administratively intensive - So I recommend some form of method to save and automate custom API calls. Quite a few more complex calls will require JSON payloads, variables, stuff like that.

    Executing a Procedure

    Planning the Procedure

    Here we'll use the API to resolve the following alarm. I'm going to use my own REST client, found here, because it's familiar. Let's write the desired result in pseudo-code first to develop a plan:

    • GET current cluster certificate ID
    • GET certificate store
    • PUT a replacement certificate with a new name
    • GET certificate store (validate PUT)
    • GET certificate ID (to further validate PUT). For idempotency, multiple runs should be supported.
    • POST update cluster certificate
    • GET current cluster certificate ID
    This process seems tedious, but computers don't ever get bored, and the objective here is to be more thorough than is reasonably feasible with manual execution! If you're thinking, "Gee, this is an awful lot of work!" trick rocks into doing it for you.

    Let's Trick Those Rocks

    Some general guidelines when scripting API calls:
    • Use a familiar language. An infrastructure engineer's goal with automation is reliability. Hiring trends, hipster cred, don't matter here. If you do best with a slide rule, use that.
    • Use libraries. An infrastructure engineer's goal with automation is reliability. Leverage libraries with publicly available testing results.
    • Log and Report: An infrastructure engineer's goal with automation is reliability. Report every little thing your code does to your infrastructure, and test code thoroughly.
    In this case, I published a wrapper for Python requests that allows me to save API settings here, and built a script on that library. Install it first:
    python3 -m pip install restify-ENGYAK

    From here, it's important to research the API calls required for this procedure (good thing we have the steps!). For NSX-T, the API Documentation is available here: https://developer.vmware.com/apis/1163/nsx-t

    NSX-T's Certificate management API also has a couple of quirks, where the Web UI and the API leverage different certificates. It's outlined here: https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.1/administration/GUID-50C36862-A29D-48FA-8CE7-697E64E10E37.html

    Since we're writing code for reliability

    I'd like to outline a rough idea of where my time investment was for this procedure. I hope it helps because the focus really isn't on writing code.
    • 50%: Testing and planning testing. I used Jenkins CI for this, and I'm not the most capable with it. This effort reduces over time, but does not reduce importance! Write your test cases before everything!
    • 30%: Research. Consulting the VMware API docs and official documentation was worth every yoctosecond - avoiding potential problems with planned work is critical (and there were some major caveats with the API implementation).
    • 10%: Updating the parent library, setting up the python environment. Most of this work is 100% re-usable.
    • 5%: Managing source code, Git branching, basically generating a bread-crumb trail for the implementation for when I don't remember it.
    • 5%: Actually writing code!
    I'm saving useful API examples in my public repository: https://github.com/ngschmidt/python-restify

    The Code

    # JSON Parsing tool
    import json
    
    # Import Restify Library
    from restify.RuminatingCogitation import Reliquary
    
    # Import OS - let's use this for passwords and usernames
    # APIUSER = Username
    # APIPASS = Password
    import os
    
    api_user = os.getenv("APIUSER")
    api_pass = os.getenv("APIPASS")
    
    # Set the interface - apply from variables no matter what
    cogitation_interface = Reliquary(
        "settings.json", input_user=api_user, input_pass=api_pass
    )
    
    # Build Results Dictionary
    stack = {
        "old_cluster_certificate_id": False,
        "old_certificate_list": [],
        "upload_result": False,
        "new_certificate_id": False,
        "new_certificate_list": [],
        "new_cluster_certificate_id": False,
    }
    
    # GET current cluster certificate ID
    stack["old_cluster_certificate_id"] = json.loads(
        cogitation_interface.namshub("get_cluster_certificate_id")
    )["certificate_id"]
    
    # GET certificate store
    for i in json.loads(cogitation_interface.namshub("get_cluster_certificates"))[
        "results"
    ]:
        stack["old_certificate_list"].append(i["id"])
    # We need to compare lists, so let's sort it first
    stack["old_certificate_list"].sort()
    
    # PUT a replacement certificate with a new name
    print(cogitation_interface.namshub("put_certificate", namshub_variables="cert.json"))
    
    # GET certificate store (validate PUT)
    for i in json.loads(cogitation_interface.namshub("get_cluster_certificates"))[
        "results"
    ]:
        stack["new_certificate_list"].append(i["id"])
    # We need to compare lists, so let's sort it first, then make it the difference between new and old
    stack["old_certificate_list"].sort()
    stack["new_certificate_list"] = list(
        set(stack["new_certificate_list"]) - set(stack["old_certificate_list"])
    )
    
    # Be Idempotent - this may be run multiple times, and should handle it accordingly.
    if len(stack["new_certificate_list"]) == 0:
        stack["new_certificate_id"] = input(
            "Change not detected! Please select a certificate to replace with: "
        )
    else:
        stack["new_certificate_id"] = stack["new_certificate_list"][0]
    
    # GET certificate ID (to further validate PUT)
    print(
        cogitation_interface.namshub(
            "get_cluster_certificate",
            namshub_variables=json.dumps({"id": stack["new_certificate_id"]}),
        )
    )
    # POST update cluster certificate
    print(
        cogitation_interface.namshub(
            "post_cluster_certificate",
            namshub_variables=json.dumps({"id": stack["new_certificate_id"]}),
        )
    )
    print(
        cogitation_interface.namshub(
            "post_webui_certificate",
            namshub_variables=json.dumps({"id": stack["new_certificate_id"]}),
        )
    )
    
    # GET current cluster certificate ID
    stack["new_cluster_certificate_id"] = json.loads(
        cogitation_interface.namshub("get_cluster_certificate_id")
    )["certificate_id"]
    
    # Show the results
    print(json.dumps(stack, indent=4))
    

    VyOS and other Linux builds unable to use `vmxnet3` or "VMware Paravirtual SCSI" adapter on vSphere

    Have you seen this selector when building machines on vSphere? This causes some fairly common issues in NOS VMs, as most don't really kn...