I haven’t been in this sort of industry for awhile now, so I might be a bit of out of touch but I imagine this hasn’t changed since I left.
I saw a fedi post recently that talked about how corporate wouldn’t let them purchase a little switch for their office to make file transfers quicker. I won’t link it here because what I’m going to dive into isn’t the point of that post but I do have experience with why corporate don’t want you to plug in that 5 port Netgear switch that everyone buys1.
BTW, IT is likely doing their best and balancing reliability, cost and supportability, along with a dozen other user issues. Regardless your company should help you solve the problems you have with infrastructure rather than just saying no.
Today we’ll be focusing specifically on the layer 2 (ethernet) technical details. We won’t be talking about the security, privacy, safety, or physical reasons. There’s a lot of legacy reasons leading to cause of this, so before we dive in we need to understand some context as to why a corporate network have separate network segments and how this impacts the fault conditions we’ll discuss.
If you work in a big building like mine, or a large campus you could hundreds to thousands devices connected to the network. Devices broadcast a lot of traffic. This traffic goes to all devices. This includes working out what IP address should assigned, what hardware address maps to what IP address, and things performing network discovery so that icons pop up to indicate there’s a printer or streaming device available. This traffic is all sent to every device on the same segment. Scaling this up to thousands of devices would cause a lot of wasted bandwidth.
Even if we have enough bandwidth to handle all the screaming from devices, we still want to make sure our system is reliable. If some of the incidents I’m going to describe below happen, we want it to impact a smaller set of users, rather than the whole network.
Ok. Lets start building our network. One switch can’t handle all our users, so lets go down to our local office supply store and pick up the cheapest network switch we can find.
Perfect. Then someone accidentally kicks the switch with their foot, stuff gets unplugged and there’s a rush to plug everything back in. We then end up with this.
The two switches end up plugged into each other. If you bought switches with spanning tree protocol (STP), then this is fine and will work2. If you didn’t, we end up with a loop.
What happens is the ethernet frames get sent back and forwards forever building up until there’s no bandwidth left for any legitimate data. This is why we have spanning tree protocol.
Problem solved! Just buy switches with spanning tree…. except… it’s a little more difficult.
Consider this example. We add our switch into this complex network.
We plug in our pirate network switch. And suddenly the whole network stops for a minute or two. What just happened is that spanning tree had to reconfigure itself. This is because your switch happened to be configured with a lower priority than others and became the root switch.
You’ll notice that not only did we get hit with spanning tree reconfiguring, but also our pirate switch has forced the algorithm to select slower links than we would otherwise have available.
To make matters worse, there are multiple different types of spanning tree: STP, R(apid)STP, M(ulti)ST, P(er)V(lan)ST.
Lets go back to a dumb switch, that seems easier. What if we were to install it like above, but someone accidentally connects the dumb switch to two different switches on the network at the same time.
What happens here is that the spanning tree enabled switches can see a link to another spanning tree via the dumb switch. They aren’t aware of the dumb switch. This can cause issues like a large amount of traffic going via your tiny switch.
Ok. Maybe we are careful and we don’t connect another switch to ours. Someone finds a loose cable and accidentally plugs the switch into itself.
Of course we have a loop. However since spanning tree thinks everything is ok, that traffic is also transmitted to the rest of the network. From the point of view of spanning tree, there isn’t a network switch there. Even if the wired network can handle the bandwidth, the WiFi access points might not be able to.
Another little quirk is that many switches are configured with a system called “port fast”. Usually spanning tree waits a period of time to figure out if there is a network switch on the other end. Port fast assumes the port is meant for a device and skips that learning/listening phase. This means that loops can exist for some time before a loop is detected. Port fast exists so that computers don’t have to wait forever to get a DHCP lease to get going.
To summarise all of this
- All switches need to be spanning tree enabled for spanning tree to be effective
- All switches need to be configured correctly so that the suitable paths are selected
- For a stable network, switches need to be configured to prevent pirate switches
Preventing pirate switches
A number of configuration options exist to to prevent issues when switches are connected:
- BPDU Guard : BPDUs are the messages sent by spanning tree. If a switch detects this on a port that has been designated as a user port it will disable the network port and requires manual reset.
- Root guard : This flags which ports on the switch we expect to find the spanning tree root. We disable ports that would have resulted in a root we didn’t expect
- Loop guard : Detects packet loops and disables the port
- Setting low spanning tree priorities
- Mac address limit : We can detect dumb switches by counting how many devices a switch port can see
Bonus 0: VLANs
How your network is configured to handle spanning tree and VLANs could be one of many many many configurations. The network might have VLANs have cover only some switches for some VLANs. Spanning tree could be running per VLAN, or a group of VLANs. This means connecting a spanning tree switch might only impact one spanning tree instance leaving loops possible in other VLANs.
Bonus 1: Unidirectional Link Detection
Unidirectional what? We like to think of network links as working or not. But there’s a secret third option - working only in a single direction. This is especially common with fibre optics and media converters.
From spanning trees point of view, it can’t see a switch on the other side and will start forwarding packets towards it, thus causing a loop. We use UDLD (Unidirectional Link Detection) to prevent this.
Bonus 2: Virtual machines
Virtual machine systems, especially complex ones, can introduce their own switching and bridging to the equation which can cause loops when trying to configure redundant links or port aggregation. They also pose other possible threats such as duplicate virtual mac addresses. Typically these will trigger the mac address limits on ports.
Bonus 999999: what about TRILL? SPB?
Network vendors don’t have your best interests in mind and decided to fuck the standards for their own vendor lock in needs.
Bonus 1000000: UniFi have a web interface to make configuring this stuff simple
So did Cisco in like the 90s. Much like UniFi it also sucked at enterprise scale.
Other reasons
So while we discussed just one technical aspect as to why just yeeting random switches into a network is a bad idea, there’s many more.
- On going maintenance - firmware updates/patching
- Security - if its a managed switch (common for STP support) then ensuring its configured securely
- Privacy - we don’t want to open the network up to sniffing of traffic
- Safety - Testing and tagging, cable tripping hazards
- More technical - Sometimes what people think are switches are routers and provide a rouge DHCP server