www/content/network/design.md
Simon Marsh 43d55c0d23
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
rebuild updates
2025-01-21 14:28:03 +00:00

159 lines
6.8 KiB
Markdown

---
title: "Network Design"
geekdocDescription: "burble.dn42 network design"
weight: 0
---
{{<hint warning>}}
This page documents a previous iteration of the burble.dn42 network and
is currently out of date.
{{</hint>}}
## Tunnel Mesh
{{<figure src="/design/DN42-Tunnels.svg" width="80%">}}
Hosts within the burble.dn42 network are joined using an Wireguard/L2TP mesh.
Static, unmanaged, L2TP tunnels operate at the IP level and are configured
to create a full mesh between nodes. Wireguard is used to provide encryption
and encapsulate L2TP traffic in plain UDP such that it hides fragmentation
and allows packets to be processed within intermediate routers' fast path.
Using L2TP allows for a large, virtual MTU of 4310 between nodes; this is
chosen to spread the encapsulation costs of higher layers across packets.
L2TP also allows for multiple tunnels between hosts and this can also be used
to separate low level traffic without incurring the additional overheads
of VXLANs (e.g. for NFS cross mounting).
Network configuration on hosts is managed by systemd-networkd and applied
with Ansible.
{{<hint info>}}
<b>Real Life Networks and Fragmentation.</b>
Earlier designs for the burble.dn42 relied on passing fragmented packets
directly down to the clearnet layer (e.g. via ESP IPsec fragementation, or
UDP fragmentation with wireguard). In practice it was observed that
clearnet ISPs could struggle with uncommon packet types, with packet
loss seen particularly in the
[IPv6 case](https://blog.apnic.net/2021/04/23/ipv6-fragmentation-loss-in-2021/).
It seems likely that some providers' anti-DDOS and load balancing platforms
had a particular impact at magnifying this problem.
To resolve this, the network was re-designed to ensure fragmentation took
place at the L2TP layer such that all traffic gets encapsulated in to standard
sized UDP packets. This design ensures all traffic is 'normal' and can
remain within intermediate routers'
[fast path](https://en.wikipedia.org/wiki/Fast_path).
{{</hint>}}
{{<hint info>}}
<b>ISP Rate Limiting</b>
The burble.dn42 network uses jumbo sized packets that are fragemented by
L2TP before being encapsulated by wireguard. This means a single packet in
the overlay layers can generate multiple wireguard UDP packets in quick
succession, appearing to be a high bandwidth, burst of traffic on the
outgoing clearnet interface. It's vital that all these packets arrive
at the destination, or the entire overlay packet will be corrupted.
For most networks this is not a problem and generally the approach
works very well.
However, if you have bandwidth limits with your ISP (e.g. a 100mbit bandwidth
allowance provided on a 1gbit port) packets may be generated at a high bit
rate and then decimated by the ISP to match the bandwidth allowance.
This would normally be fine, but if a fragmented packet is sent, the
burst of smaller packets is highly likely to exceed the bandwidth
allowance and the impact on upper layer traffic is brutal, causing
nearly all packets to get dropped.
The burble.dn42 network manages this issue by implementing traffic shaping
on the outgoing traffic using linux tc (via
[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)). This allows
outgoing packets to be queued at the correct rate, rather than being
arbitrarily decimated by the ISP.
{{</hint>}}
## BGP EVPN
![EVPN diagram](/design/DN42-EVPN.svg)
Overlaying the Wireguard/L2TP mesh is a set of VXLANs managed by a BGP EVPN.
The VXLANs are primarily designed to tag and isolate transit traffic, making
their use similar to MPLS.
The Babel routing protocol is used to discover loopback addresses between nodes;
Babel is configured to operate across the point to point L2TP tunnels and with a
static, latency based metric that is applied during deployment.
The BGP EVPN uses [FRR](https://frrouting.org/) with two global route reflectors
located on different continents, for redundency. Once overheads are taken in to account
the MTU within each VXLAN is 4260.
## dn42 Core Routing
![EVPN diagram](/design/DN42-Core.svg)
Each host in the network runs an unprivileged LXD container that acts as a dn42 router
for that host. The container uses [Bird2](https://bird.network.cz/) and routes between
dn42 peer tunnels, local services on the same node and transit to the rest of the
burble.dn42 network via a single dn42 core VXLAN.
Local services and peer networks are fully dual stack IPv4/IPv6 however the transit
VXLAN uses purely IPv6 link-local addressing, making use of BGP multiprotocol and
extended next hop capabilities for IPv4.
The transit VXLAN and burble.dn42 services networks use an MTU of 4260, however the
dn42 BGP configuration includes internal communities to distribute destination MTU across
the network for per-route MTUs. This helps ensure path mtu discovery
takes place as early and efficiently as possible.
Local services on each host are provided by [LXD](https://linuxcontainers.org/lxd/introduction/)
containers or VMs connecting to internal network bridges.
These vary across hosts but typically include:
- **tier1** - used for publically avialable services (DNS, web proxy, etc)
- **tier2** - used for internal services, with access restricted to burble.dn42 networks
Other networks might include:
- **dmz** - used for hosting untrusted services (e.g. the shell servers)
- **dn42 services** - for other networks, such as the registry services
dn42 peer tunnels are created directly on the host and then injected in to the
container using a small script, allowing the router container itself to remain unprivileged.
The routers also run nftables for managing access to each of the networks,
[bird_exporter](https://github.com/czerwonk/bird_exporter) for metrics and the
[bird-lg-go](https://github.com/xddxdd/bird-lg-go) proxy for the
burble.dn42 [looking glass](https://lg.burble.com).
## Host Configuration
![EVPN diagram](/design/DN42-Host.svg)
burble.dn42 nodes are designed to have the minimum functionality at the host level,
with all major services being delivered via virtual networks, containers and VMs.
Hosts have three main functions:
- connecting in to the burble.dn42 Wireguard/L2TP mesh and BGP EVPN
- providing internal bridges for virtual networks
- hosting [LXD](https://linuxcontainers.org/lxd/introduction/) containers and VMs
Together these three capabilities allow for arbitary, isolated networks and services
to be created and hosted within the network.
The hosts also provide a few ancillary services:
- delivering clearnet access for internal containers/VMs using an internal bridge.
The host manages addressing and routing for the bridge to allow clearnet access independent
of the host capabilities (e.g. proxied vs routed IPv6 connectivity)
- creating dn42 peer tunnels and injecting them in to the dn42 router container
- monitoring via [netdata](https://www.netdata.cloud/)
- backup using [borg](https://borgbackup.readthedocs.io/en/stable/)