Internet Load Balancing with pfSense

Oct 8, 2023 · 7 min read · Design Patterns Home Lab Telework ·

With full-time remote work, internet outages transform from a nuisance to a real problem

Prior to the pandemic, "working hours" were typically considered fair game by internet service providers to schedule necessary system maintenance. It's unrealistic to expect perfect uptime from any service provider - as the saying goes:

Schedule maintenance on your equipment before your equipment schedules it for you!

ISPs are terrible about this, mostly because "old and stable" means customers receive reliable service. Eventually that trusty Toyota Corolla dies, though, causing severe customer impact.

I'd suggest taking matters into your own hands here. The technologies involved in internet load balancing are fairly complex, but if you follow a known formula it's doable for most tech-savvy users.

Internet Load Balancing

Load balancing network traffic is traditionally a separate domain from routing and firewalling, with most of the general industry focus centering around Server Load Balancing (SLB). An Internet Load Balancer (IPv4) needs to provide the following functions reliably:

Monitor each available path for viability with some form of end-to-end test.
Evenly (or with a ratio) balance new flows between each available path.
Track related sessions and place "affinity" to a specific path, ensuring that protocols like RTP + RTCP work
NAT Outbound traffic for its relevant link (IPv4)

To clarify, this doesn't cover SD-WAN and why it's more effective. Per-packet assessment and FEC lead to a much higher quality user experience and can achieve much cleaner ratios than what I provide below, but home users typically have high individual bandwidth with their internet services and like the concept of using them to their fullest. If the connectivity options at home are sufficiently mismatched or slow, it would be worthwhile to take SD-WAN solutions into consideration.

Let's establish an example topology, and cover the tunables that will provide a "good enough" WAN load balancing solution that centers around minimizing impact to remote work:

In this scenario, we'll just assume that one's wireline and one isn't to make things easy to explain. The transport doesn't really matter much, but it simplifies any documentation from here on out.

First, let's assign a new interface for the second internet link, and configure it for DHCP. This menu can be found under Interfaces ⇾ Assignments:

Note: Ensure that "Block private networks and loopback addresses" and "Block bogon networks" are checked. This is a WAN link, after all.

When using DHCP, the secondary WAN link should automatically install a "gateway", but it won't load balance just yet. We need to create a Gateway Group to enforce load balancing policies, and then assign it as the default gateway for things to take effect:

Now, let's create monitoring IPs so pfSense can periodically test for packet loss or latency on that link. The following menu is available by editing the service provider gateway under System → Routing → Gateways:

I'd suggest using the Service Provider's DNS services or an anycast DNS provider you don't typically use for the monitor addresses. pfSense installs a static route via that WAN for the monitor address, which means that it'll go down with the WAN link.

Note: Duplicate this IP and create a DNS monitor with Uptime Kuma if you want to monitor per-provider reliably. It's quick and easy!

This is all that's required, assuming that you want to get the most even load balancing possible. Here are a few tunables that may apply to more specific scenarios:

pfSense won't load balance asymmetric link speeds by default. If the interface speeds are different, you will need to create a policy-based routing rule (Firewall → Rules → LAN → New rule), and modify the Advanced Option Gateway:
While editing the gateway (System → Routing → Gateways), look for an Advanced setting labeled Weight. This will allow you to set a ratio between gateway groups, e.g. 2:1.
pfSense provides a simplified persistence mechanism that will pin each client to a specific WAN link. This is important, particularly if your remote work situation requires comprehensive use of voice and video services like Zoom or Teams. Please note that this feature will impact load balancing evenness to a great degree!
pfSense provides gateway status under Status → Gateways, but I haven't found a way to externally track those statistics via SNMP.

The Internet and its scaling issues

We've created a problem with global routing that is just plain fascinating.

Network Address Translation allows us to "spoof" our internal private networks with multiple public prefixes. This both solves and creates problems - as an upside, we're able to leverage WAN redundancy with service-provider public IPv4 addressing somewhat easily. This matters, because the public internet routing table currently can't support a designated prefix for every home network, and we're already experiencing internet availability issues due to route propagation:

In August 2014 we celebrated "512k Day" by enjoying a number of network outages related to TCAM capacity worldwide: link
The Internet Service Providers (ISPs) at the time managed to postpone this issue by "carving" TCAM, re-allocating ternary memory from other purposes to postpone the doomsday clock. This provided a capacity of 256,000 routes, but the clock was ticking. This bought ~ 5 years of time, and this was generally enough time to bump up capacity and lifecycle hardware.

Now, we have a new problem.

IPv4 routes consume 64 bits (8 bytes) of memory each assuming that no hardware optimization is used (you can store the number 1-32 as a 5-bit integer, but route lookups would be a multi-pass operation / require a lookup table), resulting in an internet routing table size of 4 Megabytes on 512k day, or 6 Megabytes on 768k day. It doesn't sound like much, but TCAM is designed for fast lookup and is somewhat limited in capacity.

IPv6 requires 256 bits (32 bytes) of storage per prefix, but more cleanly summarizes. Apples-to apples at a million routes would be 8 Megabytes (IPv4) + 32 Megabytes (IPv6), or 40 MB of TCAM.

Most of the absolute latest networking hardware is up to the task, but this is also with decades of hacks and best-practice engineering optimizing it. If I, as a household, establish my own /64, it's not that much of a problem, but every other network doing so would result in a table exponentially larger than hardware today can handle. This generally violates the design principle of "prefix summarization is hiding useful information," but it's driven by hardware limitations (as it always has).

The Future (IPv6) Solution

Interestingly enough, IPv6 is well-suited to this solution, and simple. Endpoints typically have tons of compute resources available for simple tasks like internet load balancing - but the client software isn't quite up to snuff. A dynamic IPv6 network leverages Router Advertisements and DHCPv6 to configure host devices with DNS and IP addresses, and there is nothing restricting multiple routers from advertising multiple prefixes over the same network:

IPv6 With Multiple Router Advertisements

Well, nothing except our own internal limitations, and client software. This would require a client device to automatically test each "path" and decide which one to use for a given application. We're not quite there yet, but the key elements are in place to guarantee a much higher service quality than our core and home routers can execute.

Retrospectives

While researching this topic, I discovered a few things that might be good for budget-conscious or hands-on users:

You don't need the biggest internet plan from each service provider. While 2 500 Megabit plans are definitely not going to be equal to a gigabit service, a typical household only uses a few megabits at a time. Right-sizing your internet services will save some serious cash, and may be cheaper than the single provider plan!
Capped services can be reduced to a lower ratio, or shut off entirely when the cap is reached. This approach is particularly appealing if services in your area are capped, because bandwidth caps are a tiny fraction of your link speed, and pfSense will average out your ratio rather effectively with more diverse usage.
Purchase an appliance with at least four ethernet ports! If a second service provider makes sense, it's entirely possible that a third may become an option.
If your ISP provides notice of maintenance, it's trivial to disable a gateway temporarily(System → Routing → Gateways → Edit):
Site-to-Site VPNs will need to be pinned to a specific WAN link via static routes, or by using dynamic tunnel IDs (not IP address identities!)
- Transport within a service provider will typically have much higher available bandwidth and lower latency than transport crossing multiple ISPs. If a site is important, try to match the service providers on both sides and run a tunnel per service provider for best results.