add some commentary on fragmentation and other clearnet issues
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
Simon Marsh 2022-06-11 23:03:00 +01:00
parent 867d33a9f2
commit 6c153541ff
2 changed files with 59 additions and 12 deletions

View File

@ -81,6 +81,9 @@ Congratulations, you're connected to DN42 !
## Complex
- Isolate your network
- Using VRFs
- Using Namespaces
- Connect multiple nodes to the same peer AS in different geographic locations
- Optimise the routes to the AS
- Optimise the routes that the peer AS has to you

View File

@ -10,35 +10,79 @@ This page documents some key elements of the current burble.dn42 design.
{{<figure src="/design/DN42-Tunnels.svg" width="80%">}}
Hosts within the burble.dn42 network are joined using an IPsec/L2TP mesh.
Hosts within the burble.dn42 network are joined using an Wireguard/L2TP mesh.
Static, unmanaged, L2TP tunnels operate at the IP level and are configured
to create a full mesh between nodes. IPsec in transport mode protects the
L2TP protocol traffic.
to create a full mesh between nodes. Wireguard is used to provide encryption
and encapsulate L2TP traffic in plain UDP such that it hides fragmentation
and allows packets to be processed within intermediate routers' fast path.
Using L2TP allows for a large, virtual MTU of 4310 between nodes; this is
chosen to spread the encapsulation costs of higher layers across packets.
L2TP also allows for multiple tunnels between hosts and these are sometimes
used to separate low level traffic without incurring the additional overheads
L2TP also allows for multiple tunnels between hosts and this can also be used
to separate low level traffic without incurring the additional overheads
of VXLANs (e.g. for NFS cross mounting).
The network also supports using point to point wireguard tunnels instead of the
IPsec/L2TP mesh. In this case, the large in-tunnel MTU requires UDP fragmentation
support between the hosts.
Network configuration on hosts is managed by systemd-networkd.
{{<hint info>}}
<b>Real Life Networks and Fragmentation.</b>
Earlier designs for the burble.dn42 relied on passing fragmented packets
directly down to the clearnet layer (e.g. via ESP IPsec fragementation, or
UDP fragmentation with wireguard). In practice it was observed that
clearnet ISPs could struggle with uncommon packet types, with packet
loss seen particularly in the
[IPv6 case](https://blog.apnic.net/2021/04/23/ipv6-fragmentation-loss-in-2021/).
It seems likely that some providers' anti-DDOS and load balancing platforms
had a particular impact at magnifying this problem.
To resolve this, the network was re-designed to ensure fragmentation took
place at the L2TP layer such that all traffic gets encapsulated in to standard
sized UDP packets. This design ensures all traffic is 'normal' and can
remain within intermediate routers'
[fast path](https://en.wikipedia.org/wiki/Fast_path).
{{</hint>}}
{{<hint info>}}
<b>ISP Rate Limiting</b>
The burble.dn42 network uses jumbo sized packets that are fragemented by
L2TP before being encapsulated by wireguard. This means a single packet in
the overlay layers can generate multiple wireguard UDP packets in quick
succession, appearing to be a high bandwidth, burst of traffic on the
outgoing clearnet interface. It's vital that all these packets arrive
at the destination, or the entire overlay packet will be corrupted.
For most networks this is not a problem and generally the approach
works very well.
However, if you have bandwidth limits with your ISP (e.g. a 100mbit bandwidth
allowance provided on a 1gbit port) packets may be generate at a high bit
rate and then decimated by the ISP to match the bandwidth allowance.
This would normally be fine, but if a fragmented packet is sent, the
burst of smaller packets is highly likely to exceed the bandwidth
allowance and the impact on upper layer traffic is brutal, causing
nearly all packets to get dropped.
The burble.dn42 network manages this issue by implementing traffic shaping
on the outgoing traffic using linux tc (via
[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)). This allows
outgoing packets to be queued at the correct rate, rather than being
arbitrarily decimated by the ISP.
{{</hint>}}
Network configuration on hosts is managed by systemd-networkd.
## BGP EVPN
![EVPN diagram](/design/DN42-EVPN.svg)
Overlaying the IPsec/L2TP mesh is a set of VXLANs managed by a BGP EVPN.
Overlaying the Wireguard/L2TP mesh is a set of VXLANs managed by a BGP EVPN.
The VXLANs are primarily designed to tag and isolate transit traffic, making
their use similar to MPLS.
The Babel routing protocol is used to discover loopback addresses between nodes;
Babel is configured to operate across the point to point L2TP tunnels and with a static,
latency based metric that is applied during deployment.
Babel is configured to operate across the point to point L2TP tunnels and with a
static, latency based metric that is applied during deployment.
The BGP EVPN uses [FRR](https://frrouting.org/) with two global route reflectors
located on different continents, for redundency. Once overheads are taken in to account