Monday, December 26, 2022

What happens to packets with a VMware vSphere Distributed Switch?

Distributed Virtual Port-Groups (dvPGs) in vSphere are a powerful tool for controlling network traffic behavior. vSphere Distributed Switches (vDS) are non-transitive Layer 2 proxies and provide us the ability to modify packets in-flight in a variety of complex ways.

Note: Cisco UCS implements something similar with their Fabric Interconnects, but software control of behavior is key here.

Where do the packets go?

Let's start with a packet flow diagram:

ESXi evaluates a combination of source and destination dvPG/MAC address conditions and will ship the packet to one of the following "stages":

  • vDS Memory Bus: This is only an option if the source and destination VM are both on the same port-group and the same host
  • vDS Uplink Stage: This is where the vSphere Distributed Switch receives the traffic from the vnic and applies any proxy settings
  • UCS FI: In Cisco UCS Environments configured in end-host mode, traffic will depend on the vSphere Distributed Switch's uplink pinning, as Fabric Interconnects do not transit between redundant nodes. If they are configured in transitive node, they function as external layer 2 switches
  • External Switching: If the destination is in the same broadcast domain (determined by network/host bits) packets will flow via the access layer (or higher layers depending on the network design)
  • External Layer 3 Routing: Traffic outside of the broadcast domain is forwarded to the default gateway for handling

Testing The Hypothesis

vDS tries to optimize VM-to-VM traffic to the shortest possible path. If a Virtual Machine attempts to reach another Virtual Machine on the same host, and same dvPG, ESXi will open up a path via the host's local memory bus to transfer the Ethernet traffic. 

This hypothesis is verifiable by creating two virtual machines on the same port-group. If the machines in question are on the same host, it will not matter if the VLAN in question isn't trunked to the host, an important thing to keep in mind when troubleshooting.

An easy method to test the hypothesis is to start an iperf session between two VMs, and change the layout accordingly. The bandwidth available between hosts will often differ between the memory bus and the network adapters provisioned.

For this example, we will execute an iPerf TCP test with default settings between 2 VMs on the same port-group, then vMotion the server to another host and repeat the test.

  • Output (Same Host, same dvPG):
    root@host:~# iperf -c IP
    ------------------------------------------------------------
    Client connecting to IP, TCP port 5001
    TCP window size:  357 KByte (default)
    ------------------------------------------------------------
    [  3] local IP port 33260 connected with IP port 5001
    [ ID] Interval       Transfer     Bandwidth
    [  3] 0.0000-10.0013 sec  6.98 GBytes  6.00 Gbits/sec
        
  • Output (Different host, same dvPG):
    root@host:~# iperf -c IP
            ------------------------------------------------------------
    Client connecting to IP, TCP port 5001
    TCP window size: 85.0 KByte (default)
    ------------------------------------------------------------
    [  3] local IP port 59478 connected with IP port 5001
    [ ID] Interval       Transfer     Bandwidth
    [  3] 0.0000-10.0009 sec  10.5 GBytes  9.06 Gbits/sec
      

Troubleshooting

Understanding vSphere Distributed Switch packet flow is key when trying to assess networking issues. The shared memory bus provided by ESXi is a powerful tool when ensuring short, inconsistent paths are used over longer, more consistent ones.

When constructing the VMware Validated Design (VVD) and the system defaults that ship with ESXi, the system architects chose a "one size fits most" strategy. This network behavior would be desirable with Gigabit data centers or edge PoPs, or anywhere the network speed would be less than the memory bus. In most server hardware, system memory buses will exceed backend network adapters' capacity, improving performance with small clusters. It's important to realize that VMware doesn't just sell software to large enterprises - cheaper, smaller deployments make up the majority of customers.

Impacts on Design

Shorter paths are not always desirable. In my lab, hardware offloads like TCP Segmentation Offload (TSO) are available and will make traffic more performant outwards. Newer hardware architectures, particularly 100GbE(802.3by), benefit from relying on the network adapter for encapsulation/decapsulation work instead of CPU resources better allocated to VMs.

This particular "feature" is straightforward to design around. The vSphere Distributed Switch provides us the requisite tools to achieve our aims, and we can follow several paths to control behavior to match design:

  • When engineering for high performance/network offload, creating multiple port-groups with tunable parameters is something a VI administrator should be comfortable doing. Automating port-group deployment and managing the configuration as code is even better.
  • If necessary, consider SR-IOV for truly performance intensive workloads.
  • The default is still good for 9 of 10 use cases. Design complexity doesn't make a VI administrator "leet"; consider any deviation from recommended carefully.

As always, it's important to know oneself (your environment) before making any design decision. Few localized Virtual Machines concentrate enough traffic to benefit from additional tuning. Real-world performance testing will indicate when these design practices are necessary.

Saturday, December 17, 2022

Security patches are available for VMware vCenter 8.0 - Let's try the new vCenter Lifecycle Manager!

Let's take a look at the new lifecycle management process for vCenter.

The old process via the VAMI was easy to execute - the industry is upping the ante with automated pre- and post-testing. Cisco's NX-OS installer is another example - complex procedures (in Cisco's case, sequential PGA or microcode updates) invite problems and escalate a "simple" process to something only the senior-most engineer can safely operate.

vSphere 8 seeks to improve on vSphere 7's upgrade planner by including a "vCenter Lifecycle Manager" to administer package upgrades in an integrated, reliable fashion that includes available pre-checks and reduce "update anxiety".

Then navigate to Updates -> vCenter Server -> Update:

Under Select Version, it's now possible to view eligible updates, along with the type and Release Notes:

For those of us that use NSX Data Center or other integrated products, interoperability checks are part of the wizard

Unfortunately, the vCenter backup step is not included as part of the wizard at this time. (Note: You can back up directly to a filer with vSphere 7 or newer)

Looks like we're not quite ready to use this feature to its fullest potential yet. some notable limitations still exist and should be compensated for:

Establishing processes and automatic patch notifications (RSS is a powerful tool for that) will go a long way toward making a New Year's resolution to keep our systems up to date!

Saturday, December 3, 2022

Why Automate? Programmability is about solving new problems without fear of failure.

Have you ever heard someone say "I'm not a coder" at work?

The IT industry is changing again. Our humble origins began as polymaths moved from adjacent industries and created a new world from scratch. The pioneering phase led to unique opportunities, creating our transport protocols, programming languages, and ways of building.

The appetite for trying new things is fundamentally different now. We don't worry as much about functional quality with our IT products in this day and age. Even Windows, the butt of jokes in the 2000s provides a consistent and reliable user experience.

Is this the denouement for IT innovation? Neal Stephenson predicted this issue in 2011, examining the creativity and breakneck pace the aerospace industry developed in the 1960s/1970s.

More importantly, he brings to light a painful pattern that IT engineers often go through when trying to create new things for their company's goals - "done before" means something isn't worth doing. Don't we all buy products from (more or less) the same companies to achieve similar outcomes? Why should we care if an idea was executed before?

Liability for failure and high expectations both in quality and reduced risk are prolific in today's market. I'd argue that we have a new problem - after a decade or so of easy-to-implement, highly reliable products, we've forgotten what it feels like to try something new. We're told it costs too much, or might hurt the business when infrastructure engineers want to attempt something novel, and removing something costly is too much of a problem.

The software development market has this figured out. The shift to artistic creativity has provided some growing pains, but we see a potential bridge to the future here. Infrastructure engineers may not "be coders" but uncertain outcomes are what engineers (pragmatic creatives, not artistic creatives) excel at. Our analog in the industry, actual engineers incrementally improve a physical resource, creating safer cars or city designs that promote creative growth. They don't need to worry about the low-level components functioning.

IT is maturing, and our goals are changing, but we can't forget where we came from. Software Development is bifurcating from IT infrastructure. The internal focus for infrastructure is shifting to  providing tools and resources to developers as typical customers. We need to find strength in our pragmatic creativity.

Through rose-tinted glasses, "melting pot" innovation imparts a culture of "can-do" wherever success lives - but the transition to disposable electronics/mechanical products is removing opportunities for the development of the required skills. Deep down we know this is bad, and key themes are being set by those who can, "maker spaces" and "hackerspaces" are good examples of this key trend. We need to teach new engineers not to fear failure or the practice of trying.

This doesn't mean that we can throw caution to the wind. While I admire a farmer's ability to innovate at work, creating some trailblazing (albeit somewhat unsafe) fixes in the field is not what we need in IT. (Check out FarmCraft101 for some of the stuff they do)

We need to change how the IT infrastructure industry operates. Educating new engineers will always be at least a little bit of trial and error, and the most important thing we can do is create an environment where we balance the values of trailblazing and reliable delivery. Programmability does this for us, but we don't know how to use it fully yet - but we can look at the other engineering industries to see what might work for us.

What Works

IT Engineers all seem to agree that setting up "labs" facilitates healthy innovation. Home labs offer an environment where an individual can break and rebuild, with an onus to fix it. Resources like VMware's Hands-on-Labs provide a zero-cost method to learn without consequence, albeit with product marketing.

Dyed-in-the-wool engineers love testing. New engineers learn by examining failure without negative context; engineers may work for years before actually building anything themselves. The search term "failure analysis" provides a wealth of information on the processes used by pragmatically creative individuals to steadily improve modular designs to the point where they achieve an artistically creative outcome.

Continuous Delivery practices (DevOps) supercharge these practices; We don't have to deal with physicality. If we manage to automate testing, it costs us virtually zero time, and we can pick up the failure modes as educational resources for new engineers.

How We Can Change

I'd like to see a practice that is two-thirds engineering and one-third "redneck farm repairs". With the FarmCraft101 example, we see an admirable attitude instead of apprehension towards trying new things, and we need to combine it with mature, reliable practices.

The combination of removing or reducing costs for failure and a drive to try new things is about to reach a critical point in the IT industry. We're seeing waves of new engineers enter the industry born in the 2000s, and they don't remember having to try and set IRQs, flip DIP switches with the right type of ballpoint pen, and 32-bit memory ceilings. Personal Computers have become throwaway devices that we don't have to understand well to use, and we need ways to preserve the resilience that comes with "I can solve any problem that comes my way". Raspberry Pi, Arduino, and their lookalikes revitalize this mindset and provide a quality of education that we wish we had when we were young - let's make sure the younglings use it.

I'd also like to see some self-awareness. Most people who "can't code" are in the same boat as those who "can't write", they just don't feel artistically creative. Pragmatic creativity is the backbone of modern engineering - a concept artist doesn't design a car or make it real beyond visual aesthetic and non-functional requirements. The inability to write creatively or "code" is fixed first by identifying a useful goal and achieving it. Infrastructure engineers already do this - just look at how network engineers make long-haul connectivity meet a business objective, or how HTTP forwarding rules make an application behave better.

Let's remove the belief that coding is impossible - most of the truly "propeller-hat" stuff has been done by vendors and community members already - so this leaves most actual software development as "exchanging text files over a network" or "making deterministic paths for behavior". The reason why Object-Oriented Programming and other best practices are emphasized is to ensure you know why it's there. Understanding how an automatic transmission works and should be appropriately used helps a driver improve their skills, but the point to which we learn varies from person to person, and a deep understanding isn't always necessary.

Know your strengths, and know that you can meaningfully contribute by using them. It might take a lifetime to figure out how, but it's worth it.

Sunday, November 6, 2022

Tweak Application behaviors with NSX ALB

Not every application is designed to leverage a load balancer.

Early load balancing solutions capitalized on this fact - which may seem odd. The solution to the supportability issue creates new opportunity; we'll look at some examples here. Adjusting application behaviors quickly becomes a powerful tool when enhancing customer experience.

This was the story before digital transformation patterns began to emerge, at least. Business now prefer to use Commercial Off-The-Shelf (COTS) software and Content Management System (CMS) systems instead of writing raw code or web frameworks. This means that the software our SWEs are using is considerably less "tunable" than it used to be, and that the business line expects victory when adopting newer software.

They aren't wrong, the advantage at the other end of this transition is more maintainable software. It's important to be accommodating and compassionate when helping web developers improve their app - they didn't write it, and it's a new approach for them too.

Knowledge of how HTTP works is valuable here. For this exercise, I'm going to proxy Jenkins - it's not designed for a load balancer and is open source.

It's really important to standardize behavior with load balancers. Any profile or rule should be captured either in source control or a manager of managers. I prefer both, because tools like Git Pull Request provide a valuable place for peer review and seeking input on standards. It's good to have standards, it's better all qualified engineers agree on the standard (and there's a way to version/release new ones!)

We'll be using the following profiles for this application as a starting point:

  • http-profile-Jenkins-v1.0.0
    • This profile looks comprehensive - Avi Vantage stores all variables explicitly when anything is customized. Here are the highlights:
      • Enable HSTS: (Not required, I just like it. NSX ALB/Avi doesn't configure a separate service for port 80, so this ensures we don't run cleartext traffic.
      • Reset HTTP on HTTPS connections: This is an Avi oddity - by default, HTTPS services will return an HTTP 400 series code instead of resetting the connection. I don't like giving out thumbprint data on a given port when it asks for illegal output.
      • X-Forwarded-For: Avi will include an HTTP Header with this label and the client's IP address. This allows a web engineer to configure logging with the real address in 1-arm deployment scenarios
  • clienttls-v1.0.0
    • This client TLS profile is short because we can be strict on security.
      • Include a short acceptable cipher suite/cipher list, and enforce forward secrecy
      • Only support TLS 1.2 and 1.3
      • Prefer server cipher ordering (enforce good cryptography)

We've already covered significant ground in terms of standardization and user experience here. Many web apps don't plan to implement user-friendly TLS, for example, and benefit from NSX ALB/Avi in front of their web tier, acting as a security guard until the client successfully negotiates TLS and navigates to the appropriate port. Many offerings also layer WAF, DDoS protection into their products for this very reason.

Let's configure a basic virtual service:

This basic wizard will establish some acceptable defaults. Let's modify the virtual service and add our settings:

  • Application Profile
  • TLS Profile
  • Certificate

Let's also configure this service to listen on port 80.

Note: Avi will forward unencrypted traffic to the pool, so some loss of privacy may occur if you don't configure the redirect rule under Profiles -> HTTP Request prior to saving!

The new service should be healthy under Application -> Dashboard:

Let's try and use the app!

We encounter an error when clicking "Manage Jenkins":


This is the configured name of my Jenkins server - and it's no longer using the virtual server's name.

We encounter this issue commonly with CMS, and "hard linked" applications can prove difficult to resolve. There are two approaches to resolving this issue:

  • Leverage redirects, either:
    • Rewriting any redirects that flow through the load balancer
    • Redirecting from port 8080 to the correct service
  • Rewrite HTML as it streams through the load balancer
    • Avoid this if possible, it's computationally very costly and may not support production traffic levels. It's still useful while developing a new website, as it might be a quicker hypothesis test than a coded solution or a vendor request for change (RFC)

Before diving into a specific fix, it's quite easy to categorize the problem with Firefox. Press F12 on your keyboard to bring up the developer view, and select the Network tab:

Note: This is a big part of why web developers always think it's the network that broke their application.

Click the link again, and you will see a play-by-play of the HTTP transactions executed complete the transaction:

That's a bingo. Jenkins is forwarding a non-relative HTTP 302 (also known as: The link is in another castle HTTP code). Let's rewrite it with NSX ALB by creating an HTTP Response Policy

Here's a quick cheat sheet:

  • HTTP Response policies let an engineer edit HTTP transactions (but not the body) from the server to the client
  • HTTP Request policies let an engineer edit HTTP transactions (but not the body) from the client to the server

The `Location` header in an HTTP transaction applies specifically to redirects (301,302) - and as such is easy for a load balancer to rewrite. In this case, it wasn't the HTML pages that had a bad link - the redirect chain `/manage->/manage/` was the culprit.

VMware NSX ALB (and any other self-respecting ADC) presents an infrastructure engineer with near unlimited power to ensure that an application is delivered correctly (even if there's a flaw in the product). Check out the Developer Screen's network tab (F12) in Firefox/Chrome on some common websites - you will find that many of them employ similar "fixer" tactics.

Sunday, October 23, 2022

Track Certificate Expiration with Jenkins and Python 3!

CI/CD tools aren't just for automatically deploying apps! Jenkins excels at enabling an engineer to automatically execute and test code - but, it has a hidden super-power: Automating boring and intensive IT tasks(removing toil).

Let's take a common and relatable IT problem - it doesn't matter if you're a DevOps engineer, a Agilista, or even a "normal" systems engineer. Tracking certificate expiration is not an enjoyable task, and can often involve either manual checking or (usually) outages to discover that a certificate has expired.

This solution will have several major elements:

  • An inventory of TLS-issued hosts
  • A Python 3 script (leveraging OpenSSL) to open up TLS connections and fetch certificates
  • A Jenkins pipeline to execute that script against that inventory daily, emailing the results

Inventory

Full transparency, this example is executed in a home lab. It's naive to assume that this task is trivial for any enterprise, but here are some potential approaches to building an inventory at scale:

  • Write a Python script to ingest DNS zone files, and loop curl to see if any listen on port 443
  • Fetch a report for a vulnerability scanner (Retina, Qualys, Nexpose)
  • Searching PKI issuance reports (if available)

We also want to write our inventory file in a way that's friendly to our execution approach. Python is dynamically typed, and most IT automation is fine with that - we're not doing any hardcore programming for most of it. The vast majority of IT automation involves sending and processing files and I/O.

Python will change a variable to any data type when you tell it to, so it's useful to map out what we want. Here are the relevant data types. I will also include the symbols JSON uses to signify them (if applicable):

  • String (""): This is a type that encapsulates a series of text characters
  • Integer (No wrapping): Whole number, and Signed. It can be positive or negative, but there's no decimal point (decimal points are their own unique flavor of complexity in computer programming)
  • List ([]): To a dyed in the wool software developer, this will be similar to an array. Lists are indexed by an integer, and can contain any data type below it
    • Python has a neat trick where a for loop can return a list item instead of the index, which saves a great deal of code
    • Python can sort a list by executing the function .sort() on that object
  • Dictionary ({}): This is an advanced construct, and provides an engineer with a great deal of capability (at the expense of performance, and code simplicity in some cases)
    • dicts store entries as key-value pairs, and the index is usually a string
    • Python can add to a dictionary by adding a new key, e.g. dictionary["newkey"] = "b33f"

When planning software functionality, we want to always use the right tool for the job. Downstream APIs (e.g. OpenSSL) want to see a particular format for a parameter(e.g. TCP port should be an integer), so documentation research is a must at this phase. I'll explain my logic for this file:

  • I want to easily iterate through the list, without addressing indexes, and I want it to be fast. I should use a list for the top-level data in the inventory ([])
  • I want to ensure that I don't accidentally address the keys wrong, so each individual entry should be a dictionary ({}) with the following keys: fqdn, port
    • fqdn should store a string
    • port should store an integer

Example:

[ { "fqdn": "vcenter.engyak.co", "port": 443 }, { "fqdn": "nsx.engyak.co", "port": 443 } ]

Python Code

Here's a copy of my code. To execute it, the following pip packages need to be installed:

  • fqdn
  • OpenSSL
  • ruamel.yaml

datetime in particular does quite a bit of heavy lifting here. The package providers in the Python community have managed to solve most of the truly difficult work, so interpreting expiration dates is a simple comparison operation.

I heavily rely on functions for this code to work in a maintainable format - this code is only 167 lines long, but most of the usage is for readability.

Another point of note - when writing Python to execute in a pipeline, it helps to be Perl levels of dramatic when crashing code. Jenkins doesn't evaluate output by default, and the easiest way to notify of a problem is by using sys.exit(""). This is why I placed a crash if errors exist at the end of the list.

Jenkins

This configuration example should provide some basic level of functionality. Jenkins has a lot of capability, so this tooling can be endlessly tweaked to your needs.

First, let's set up a SMTP server. With a default installation, the settings are under Dashboard -> Manage Jenkins -> Configure System:

Advanced Settings will allow you to configure SMTP auth, port, if applicable. If you use Gmail, you can still leverage MFA and app passwords, preserving MFA and avoiding password proliferation.

Now, let's set up a freestyle pipeline:

With Jenkins, all things should be executed from source code. This is the way.

We want to run this daily, irrespective of source code changes. This requires a slight deviation from the usual Poll SCM approach:

As always, amnesic workspaces are best:

Python and the inventory file simplify the Jenkins configuration as well. Just execute the script as-is:

The final step is to add a Post-Build Action to email if there is a failure:

It really is this simple. Jenkins will now execute daily and email you a list of expired, and soon to expire certificates!

Lessons Learned

I'm going to improve this code. Here are some of my ideas:

I'm continually amazed at what the open source community can achieve with this level of simplicity. Would you consider this approach out of reach or too challenging?

Saturday, October 15, 2022

Gathering and Using Data from Cisco NX-OS with Ansible Modules

easy button

Reliably executing repetitive tasks with automation is easy (after the work is done).

Given enough work, self-built automation can be easy to consume. Non-consumers (engineers) need to focus on reliability and repeatability, but occasionally there's an opportunity to save time and simplify lives directly.

Information gathering with Ansible is a powerful tool, making the level of difficulty to perform a check on one network node roughly equal to the effort on 2, or even one hundred. Here's a quick and easy way to get started.

Ansible Inventory

Ansible likes to know where each managed node lives, and provides the inventory capability to organize similar devices for remote management. Not all network automation endpoints use the inventory feature, so ensure that you read the published documentation first. 

Note: The easiest way to check inventory dependency is to verify if there are directives in the playbook named hostname, username, or password. If they exist, that module probably does not use inventory.

Ansible supports two formats for an on-controller inventory, conf (Windows-like) and YAML (Linux-like). Here's an example in YAML, I personally find it easier to read:

---
  nxos_example_001:
    hosts:
      nexus_1:
        ansible_host: "1.1.1.1"
      nexus_2:
        ansible_host: "2.2.2.2"
      vars:
        ansible_user: "admin"
  nxos_all:
    children:
      nxos_example_001:

We have a little bit to unpack here:

  • The first hierarchical tier is for groups, which can contain other groups if you use the children: directive (see nxos_all as an example)
  • vars: will specify variables to commonly use across all members of that group
  • ansible_host is used to specify an address - and is useful with dual stack environments (or ones that don't have DNS)

Ansible Facts

Ansible stores all of its runtime variables for a given playbook as facts. This is held as a Python dict at runtime by Ansible Engine, and the debug: module allows an engineer to print the output to stdout:

---
- hosts: localhost
  connection: local
  tasks:
    - name: "Print it!"
      debug:
        var: lookup('ansible.builtin.env', 'PATH')
    - name: "Print it, but with msg!"
      debug:
        msg:
          - "The system environment PATH is: {{ lookup('ansible.builtin.env', 'PATH') }}"
          - "Wise engineers don't use this feature to print passwords"

Running this playbook will produce the following:

ansible-playbook debug.yml 
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'

PLAY [localhost] *************************************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *******************************************************************************************************************************************************************************************************************************************************************
ok: [localhost]

TASK [Print it!] *************************************************************************************************************************************************************************************************************************************************************************
ok: [localhost] => {
    "lookup('ansible.builtin.env', 'PATH')": "/root/.vscode-server/bin/d045a5eda657f4d7b676dedbfa7aab8207f8a075/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
}

TASK [Print it, but with msg!] ***********************************************************************************************************************************************************************************************************************************************************
ok: [localhost] => {
    "msg": [
        "The system environment PATH is: /root/.vscode-server/bin/d045a5eda657f4d7b676dedbfa7aab8207f8a075/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
        "Wise engineers don't use this feature to print passwords"
    ]
}

PLAY RECAP *******************************************************************************************************************************************************************************************************************************************************************************
localhost                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0 

msg: is effective for formatted output, while var: is considerably simpler when dumping a large dictionary. var: does not require Jinja formatting, which may cause playbooks to be simpler.

Let's apply this to a Cisco NX-OS Node. We can register command output from the nxos_facts module.

Note: The example provided below is the "new way", where Network modules follow the Ansible rules. If using older versions of Ansible (Ansible 2), the following may not be fully available!

First, we need to update the Ansible inventory. We will be using the API method to collect data, and it requires multiple new variables:

  • ansible_network_os: Instructs Ansible on what module to use for that system
  • ansible_connection: Instructs Ansible on what transport to use (HTTP API, SSH)
  • ansible_httpapi_use_ssl: Instructs Ansible to use HTTPS
---
  nxos_example_001:
    hosts:
      nexus_1:
        ansible_host: "1.1.1.1"
      nexus_2:
        ansible_host: "2.2.2.2"
      vars:
        ansible_user: "admin"
        ansible_network_os: 'cisco.nxos.nxos'
        ansible_connection: ansible.netcommon.httpapi
        ansible_httpapi_password: ''
        ansible_httpapi_use_ssl: 'yes'
        ansible_httpapi_validate_certs: 'no'
  nxos_all:
    children:
      nxos_example_001:

The updated inventory allows us to run extremely simple playbooks to gather data

---
- hosts: nxos_machines
  tasks:
    - name: "Gather facts via NXAPI"
      cisco.nxos.nxos_facts:
        gather_subset: 'min'
        gather_network_resources:
          - 'interfaces'
      register: nxos_facts_gathered
    - name: "Print it!"
      debug:
        var: nxos_facts_gathered
ansible-playbook debug_nxos_facts.yml 

PLAY [nxos_machines] *************************************************************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ***********************************************************************************************************************************************************************************************************************************************************************************************
[WARNING]: Ignoring timeout(10) for cisco.nxos.nxos_facts
ok: [nx-1]

TASK [Gather facts via NXAPI] ****************************************************************************************************************************************************************************************************************************************************************************************
ok: [nx-1]

TASK [Print it!] *****************************************************************************************************************************************************************************************************************************************************************************************************
ok: [nx-1] => {
    "nxos_facts_gathered": {
        "ansible_facts": {
            "ansible_net_api": "nxapi",
            "ansible_net_gather_network_resources": [
                "interfaces"
            ],
            "ansible_net_gather_subset": [
                "default"
            ],
            "ansible_net_hostname": "AnsLabN9k-1",
            "ansible_net_image": "bootflash:///nxos.9.3.8.bin",
            "ansible_net_license_hostid": "",
            "ansible_net_model": "Nexus9000 C9300v",
            "ansible_net_platform": "N9K-C9300v",
            "ansible_net_python_version": "3.9.2",
            "ansible_net_serialnum": "",
            "ansible_net_system": "nxos",
            "ansible_net_version": "9.3(8)",
            "ansible_network_resources": {
                "interfaces": [
                    {
                        "name": "Ethernet1/1"
                    },
                    {
                        "name": "mgmt0"
                    }
                ]
            }
        },
        "changed": false,
        "failed": false
    }
}

PLAY RECAP ***********************************************************************************************************************************************************************************************************************************************************************************************************
nx-1                       : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Ansible's Inventory feature enables us to scale per node without any additional code - the previous playbook will execute once on every inventory object in the group, which allows an engineer to thoroughly test a playbook on lab resources with some level of separation.

Deliberate automation design will bear fruit here - as safety is key when developing and testing automation. Like with previous automation-centric posts, thorough, comprehensive testing of automation for reliability is a social responsibility when creating tools. 

Establishing a separate CI/CD tooling set to target a lab (or CML, as in this case!) enables us to add additional safeguards against accidental changes, such as ACLs/Firewall policies preventing access from Test CI/CD -> Production network assets. Tools like CML take it even further by allowing an engineer to spin up amnesic NOS instances to run code against.

Here's an applicable instance. Recently, Cisco disclosed a vulnerability with Cisco Fabric Services - and most environments don't need that service running. This is an aggressive fix - but with Ansible we can check for the service and disable it only if it's running, and then check again afterwards. This illustrates the value of idempotency, or the practice of running repeated executions safely.

Friday, September 23, 2022

Using cloud-init with vSphere and openSUSE 15.4

Rapidly deploying Linux servers to meet a whim represents the essence of home lab activities, but we spend a great deal of time spinning/configuring machines to meet our specs.

Worse, we lose a great deal of time keeping them properly configured and up to date, and none have the privilege of unlimited lab time.

Let's explore a way to get a base template implemented in vSphere 7 and enable the machine to boot with customizations like hostname, IP address, startup scripts, etc.

Constructing a VM Template

First, let's pick up a fresh operating system installer ISO from opensuse.org. Since this is a home lab / server-style deployment, I'd recommend using the network image - we'll add everything we want later.

Upload the ISO file to a datastore. This step will allow the installation process to run unattended, even if you shut down the client:


Create a virtual machine, and name it accordingly. Attach the datastore ISO:

Boot the Linux machine. During the installation wizard, ensure that a logical volume manager (LVM2). I've found that when you build a clone template, any disk size you choose will be wrong in the application owner's mind, so plan for the future.

After the installation is complete, disconnect the CD/DVD virtual drive! If you fail to do this on shared infrastructure, the VI admins will have a difficult time with the VM - and in a home lab, that's you. Establish good habits to make responsible tenancy easy.

Start up the machine, and use zypper to install any packages or keys that may be required to administer the device. In a home lab, CA certificates and SSH keys are OK - but an enterprise environment should have an automated, repeatable way to lifecycle trust in the event of a compromise.

Once that's done, let's install cloud-init. This software package is incredibly useful, but it isn't available by default with OpenSUSE Leap:

After installing the package, ensure it's enabled with:

systemctl enable cloud-init
systemctl enable cloud-init-local
systemctl enable cloud-config
systemctl enable cloud-final
cloud-init clean

Cloud-Init

Cloud-init is a project managed by Canonical to standardize VM customization on boot, making IaaS more "cloudy", regardless of hosted location. It is structured to receive configuration data from a datasource and abstracts the specific inputs from other "clouds" to the IaaS workload (VM) as consistent instructions. The customization software will use these data sources as "drop points" to transform the cloud-specific instructions (OVF, Azure, EC2) to a common configuration (Metadata, Userdata).

metadata should represent the workload's system configuration, like hostname, network configuration, and mounts.

userdata should represent the workload's user space configuration, like Ansible playbooks, SSH keys, and first-run scripts. With the current state, I would tend towards using automation to register a workload with Ansible and perform that configuration centrally. It's neat that this level of customization is offered, though - cloud-init can automatically register with centralized orchestrators like SaltStack and Puppet on startup.

cloud-init has a ton of goodness available as boot-time customization, and this will only scratch the surface of how it can be used. cloud-init accepts a YAML configuration that can include:

  • Users/Groups
  • CA certificates
  • SSH keys
  • Hostnames
  • Packages/Repositories
  • Ansible Playbooks
  • External mounts (NFS)

VMware offers two data sources for workloads provisioned on vSphere:

VMware's new RESTful API has built-in documentation. From the vSphere GUI, select the triple ellipsis and select "Developer Center":

Unfortunately, VMware's new metadata source does not appear to function with this distribution. According to Canonical's changelog, cloud-init Version 21.3+ is required to recognize the new datasource. I tested with OpenSUSE 15.4 (Ships with cloud-init 21.4) and received the following error:

# A new feature in cloud-init identified possible datasources for        #
# this system as:                                                        #
#   []                                                                   #
# However, the datasource used was: OVF                                  #
#                                                                        #
# In the future, cloud-init will only attempt to use datasources that    #
# are identified or specifically configured.                             #
# For more information see                                               #
#   https://bugs.launchpad.net/bugs/1669675                              #
#                                                                        #
# If you are seeing this message, please file a bug against              #
# cloud-init at                                                          #
#    https://bugs.launchpad.net/cloud-init/+filebug?field.tags=dsid      #
# Make sure to include the cloud provider your instance is               #
# running on.                                                            #
#                                                                        #
# After you have filed a bug, you can disable this warning by launching  #
# your instance with the cloud-config below, or putting that content     #
# into /etc/cloud/cloud.cfg.d/99-warnings.cfg                            #
#                                                                        #
# #cloud-config                                                          #
# warnings:                                                              #
#   dsid_missing_source: off                                             #
**************************************************************************

To view the provided and applied metadata for a system, cloud-init provides the following file handle:

/run/cloud-init/instance-data.json

To view the userdata for a system, use the following command:

cloud-init query userdata

This indicates that we probably have an upstream issue with the new data source type. Reviewing the change log we see several fixes applied to this data source. 

Applying Workload Templates

Note: This feature is only available on vSphere 7 and up!

Here's how to leverage the OVF data source with vSphere and OpenSUSE.

The flag disable_vmware_customization is a directive that functions as a switch to choose between the metadata source and the OVF data source. following to /etc/cloud/cloud.cfg:

disable_vmware_customization: false
datasource:
  OVF:
    allow_raw_data: true
vmware_cust_file_max_wait: 25

Once installed, shut the virtual machine down. Right-click on the VM, and select Clone -> Clone as Template to Library:

This vCenter feature will orchestrate the conversion to a template object and publish it to a Content Library as one step.

Deploying a customized machine

The next process needs to be executed via vCenter's Content Library vSphere API:

  • Establish API Session Key (required authentication for the endpoints used to deploy)
  • Deploy Content Library Object (/api/vcenter/vm-template/library-items/)
    • Find the correct content library
    • Find the correct content library item
    • Find the content library item via the vsphere API (ID to use in deployment command)
    • Find vSphere Cluster
    • Find vSphere Folder
    • Find vSphere Datastore
    • Deploy Content Library Item
  • Wait until deployment is complete, periodically checking to see if it's complete
    • Normally, an API will respond immediately that the command was successful, and subsequent calls would be required to validate readiness. Instead, vSphere's RESTful API responds with a 200 response only if and when the deployment is complete, which simplifies our code
  • Locate the Virtual Machine. The previous API call responds with a 200 OK, and Postman conveniently times the operation for you as well!
  • Apply Guest Customization
  • Start VM

To replicate this lab, the Postman Environment and Collection will be provided at the bottom of this post. Postman provides a powerful platform to educate engineers unfamiliar with a particular API by expanding the behaviors an HTTP client may have. Automated processes are typically very terse, and do not effectively explain each step and behavior. To import this collection and environment, download the files, and import them:

Postman Environments will stage variables for consumption by Collections.

I have sorted the Postman Collection based on the order of execution. The final customization step will return a 204 if successful, with an empty body. To verify that the configuration was correctly applied, browse to the individual VM in vCenter, and look under Monitor -> Events for an event of the type "Reconfigure VM". If you see the task on the correct VM, start it, and you will see the following:

Soon after, look at the virtual machine to review its customized attributes!

Debugging/Troubleshooting Tips

This process is slightly opaque, and a little confusing at first. Here are some key points for troubleshooting, and the methods to manage it:

  • The vSphere /vm/guest/customization URI will only respond with a 204 if working correctly.
    • If it returns a 400, the error will indicate what part of the JSON spec is having issues. Keep in mind that it may only give you the parent key - tools like JSONLint offer a method to quickly validate payloads as well
  • When locating resources, the Content Library and Templates are returned as a UUID with no description. GET the individual objects to match with names, or use the find API
  • All other resources (datastore, VM name) are listed with their MOB name, e.g. domain-c1008
  • Save the response from the deployment action, it has the VM ID when it finally completes
  • VM Customization can only be applied to a VM that is OFF, and doesn't customize until the VM starts.

Troubleshooting customization after boot can be done by viewing the metadata (/run/cloud-init/) or by reviewing logs at the following locations:

/var/log/
/var/log/vmware/imc
journalctl -xe
systemctl restart cloud-init

The classic "wipe and restart" method is also quite valuable:

cloud-init clean -l -s
systemctl restart cloud-init

Finally, after a host is successfully configured, I'd recommend disabling cloud-init to prevent further customization. This is just as easily achieved with an Ansible playbook

systemctl disable cloud-init cloud-final cloud-init-local cloud-config

Code

Saturday, August 13, 2022

Identity theft has gotten out of hand. Here are basic ways to protect yourself.

It's not a matter of if you will be the victim of a breach, but when.

Wired is starting to track breaches by halves (as a general tech publication), and security vendors are moving to monthly reporting due to the volume.

It's 2022, and it seems everyone loves to over-share on social media. This may feel good but introduces substantial risks. Let's talk about cyber hygiene.

Information security is a frame of mind, so the most effective way to protect yourself is by being smart. ISC2 has started an institution - The Center for Cyber Safety and Education -  to provide further effective education on how to comprehensively protect yourself online.

Here are some brief tips to help keep an eye on when you shouldn't disclose information online. Always ask "Can I dial this back? Do I need to provide this much information?"

  • Personally Identifiable Information (PII) can provide adversaries with methods to fake your identity
    • Birthdays. social media companies love to collect them, and they're used for ID verification everywhere. Facebook doesn't need your exact birth date, and storing it there increases your risk. Avoid storing your full birth date whenever feasible
    • Credit Card Numbers, Expiration Dates, CVVs
    • Any image of any ID card you own. Driver's License numbers are particularly popular.
    • Hometowns or birth locations are fun to socialize, but fit in this same category
    • Full Middle Name
    • "Mother's maiden name" and other names unlisted and typically used by financial institutions or security questions. Social media quizzes aggressively try to steal information like this!
    • Previous Employers
    • Home address / shipping address. These are typically used to validate credit card transactions, particularly large charges
  • Personal Health Information (PHI) are typically protected by HIPAA, with large exceptions for non-medical institutions. Don't share any of this information without full disclosure on how that information will be used!
    • Medical history, surgeries, etc.
    • Ancestry information

It's worth re-iterating, your children are much more likely to be targeted as well. Here are some guidelines on how to protect them from discovering they have a mortgage and a compromised credit score in junior high.

This is the most important, but also the most difficult. We can use products or services to protect your identity and shore any gaps.

Credit Locking / Credit Freezes

Now that we're done scaring you, the good news is that providing some basic level of protection against identity theft isn't particularly hard. Crime does pay, and the most effective way to terminate the pattern is to pursue every avenue to prevent new credit being opened with your identity. Most banks, utilities and other services won't open a credit account without a credit report, so the most effective method of countering compromise is to disallow any and all credit report attempts. The neat thing about this method is that people who are providing legitimate services to you can be sneaky and execute reports without your consent, dinging your credit score in the process.

If you don't do anything else I suggest, do this. It's going to take 5-10 minutes to do all three. Here are the links to "freeze credit" (prevent credit reports from being executed with your information):

Note: You'll need to create a new account for each of these services! Don't lose this information!

Use a Password Manager

To quote Mel Brooks: "12345! That's amazing! I have the same combination on my luggage!". Cryptography isn't magic, and all the transport security and firewalls in the world can't protect you from weak identity material. 

The most effective way (for the least effort) to de-risk yourself is to set up a password manager. We see some peripheral advantages outside of password storage like storing confidential documents, sharing passwords between family members, etc.

I'm not going to recommend a specific product here, because needs can vary quite a bit depending on needs. Here are some typical requirements I keep in mind when evaluating a password manager:

  • How strict is its MFA? Can you disable SMS TOTP? Is a hardware security token like Yubikey supported?
  • Does it support a family plan?
  • What is its breach response plans?
  • How securely to store their data?
  • Is it compatible with my devices?

Personally, I use 1Password for the Yubikey support and family plan support. It gives me piece of mind, and has a feature where all passwords are released to my family if I fail to log in for a month. Here are some others, in no particular order:

Using one is better than not - so all of these would be an improvement over nothing at all. I've used Dashlane and LastPass and dropped them in favor of 1Password.

Multi-Factor Authentication

Multi-factor authentication can be broken out into the following major categories:

  • Something you know: Passwords are an example of this "authentication factor". If a credential is publicly exposed (e.g. used on the Internet) it should be unique to that service to ensure that your banks don't get compromised if your Twitter password leaks
  • Something you have: The most common MFA tools fit in this category. Yubikeys are fantastic (if supported), and the following Time-based One-time Pad (TOTP) apps are good options. I don't personally have any strong preference other than AVOID SMS / TEXT MESSAGE MFA!
  • Something you are: Murky waters abound here, because you must be completely fine with submitting your biometrics to a third party. I'm not keen on doing this, given its potential for misuse. Most consumer fingerprint scanners are "passable" at best, so I don't consider this a good standalone authentication factor.
  • Somewhere you are: Location-based services are usually somewhat iffy as well for private non-enterprise non-government, as they aren't particularly accurate. If you're consuming a service like Gmail, the company should provide this for you.
  • Something you do: This is a real propeller-hat scientific factor. Capturing behavior patterns can reveal whether you're behaving normally. Again, this is mostly the responsibility of the group providing you a service.
    • There's a low-tech way to provide this authentication factor in the real world - paying a security guard. They're good at this and don't need a Ph.D to do it.

Identity Theft Protection

Now, it's time to bring out the heavy hitters. We don't always have the time to keep an eye on the entire internet, or to research recommendations to reduce our online footprint.

Leaning on the experts in identity theft protection services is the way to go. The industry is awash with good options, and the providers of these services aggressively drive costs down to make it affordable.

Full disclosure, I am employed by Allstate, who provides ID theft services. These recommendations are my own and not my employer's.

Here are some guidelines when evaluating ID Theft Protection services:

  • Do they have a family plan? Children's ID theft is on the rise, mostly because it's easy to predict SSNs given a birth location, easily available information like birth date and addresses, etc. You'd think creditors would avoid opening up a credit card in a newborn's name, but you'd be wrong. Add them to your ID theft protection, freeze their credit!
  • What services do they monitor? A minimum should maintain tracking your credit score without affecting it!
  • What insurance do they provide?
  • What guidance and periodic advice do they offer to customers and the public?
  • What recommendations do they make to improve your online presence?

I'd avoid the ones provided by the credit industries - the Equifax breach impacted my confidence, and nothing brought it back.

As an aside, if you've been a victim of any of the wave of breaches recently, you're probably eligible for free ID theft protection services from multiple companies. Use this to shop around, if you like one, stick with it. If you don't find any you like, here are some popular ones:

Shop around!  The worst thing you can do with your online presence is to do nothing, and there's a wide variety of good products to help you out. These services provide a trial, use it to evaluate if it's a good fit.

Conclusion

Society has passed the "age of innocence" with identity theft, and cybersecurity will need to become a routine for anyone living in it. Pandora's box has been opened, and criminals are not going to forget how easy and low-risk cybercrime is. Protecting yourself is a rabbit-hole where all effort is valuable - but you don't need to be a security expert to get the basics in place.

Popular Posts