Eleven hours after the cluster rebuild, I tried to use Transmission and it wasn’t there. Not erroring. Not crash-looping. Just… not there. The pod had never been created, and I hadn’t noticed because nothing screamed at me about it.
Two chained time bombs, both invisible until the cluster rebuild forced pod recreation. The first one was a PodSecurity policy that silently blocked Gluetun’s NET_ADMIN capability. The second — stale IPv6 IP rules from the previous pod’s unclean shutdown — was hiding behind it.
| Date | 2026-03-08 |
| Duration | 11+ hours (transmission unreachable) |
| Severity | Single service outage — Transmission/Gluetun VPN stack completely down |
| Trigger | apps namespace PodSecurity baseline enforcement blocking Gluetun’s NET_ADMIN capability |
| Related | Follow-on from cluster outage post-mortem |
Timeline (Quick Reference)
| Time | Event |
|---|---|
| T+0 (previous incident) | Cluster hard reset after node restart outage |
| T+0 (previous incident) | PodSecurity restricted enforcement added to networking namespace |
| T+0 (this incident) | apps namespace already had baseline enforcement from previous security review |
| T+0 | Transmission deployment created by ArgoCD after cluster rebuild |
| T+0 | ReplicaSet fails to create pod — FailedCreate due to PodSecurity violation |
| T+11h | Issue noticed: “Still can’t use transmission” |
| T+11h+5m | Root cause identified: Gluetun NET_ADMIN blocked by baseline enforcement |
| T+11h+10m | Fix: apps namespace changed from baseline to privileged enforcement |
| T+11h+12m | ArgoCD synced namespace change, rollout restarted |
| T+11h+13m | Pod created, but Gluetun stuck in retry loop — IPv6 rule conflict |
| T+11h+20m | Second fix: postStart lifecycle hook to clear stale IP rules |
| T+11h+25m | Pushed, ArgoCD synced, VPN connected, transmission fully operational |
Part 1: The Incident — A Second Time Bomb from the Same Root Cause
Context: The Previous Outage Created This One
Hours earlier, the cluster had been rebuilt after the node restart outage. During that recovery, I found that the networking namespace had restricted PodSecurity enforcement blocking cloudflared and gitlab-agent pods. Fixed those manifests, moved on.
What I didn’t check: whether the other namespace — apps — had a similar time bomb waiting. It did.
During a previous security review, the apps namespace had been given baseline enforcement:
metadata:
name: apps
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/warn: restricted
When this was applied, every pod in apps was already running. The admission controller never checked them. The baseline standard seemed safe — it’s less strict than restricted, and most applications don’t need host networking or privileged containers.
But one application does need something baseline blocks: Gluetun, the VPN sidecar for Transmission, requires NET_ADMIN capability to create WireGuard tunnels and manage iptables firewall rules.
The Symptom: 11 Hours of Silence
After the cluster rebuild, ArgoCD synced all applications. Most pods came up fine. Transmission did not. The ReplicaSet entered FailedCreate state and began exponential backoff:
kubectl describe replicaset -n apps -l app=transmission
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 52m replicaset-controller Error creating: pods
"transmission-64f594958d-ztvlm" is forbidden: violates PodSecurity
"baseline:latest": non-default capabilities (container "gluetun" must
not include "NET_ADMIN" in securityContext.capabilities.add)
The error is precise: baseline rejects any capability additions beyond the default set. NET_ADMIN isn’t in the default set. The pod was never created.
Because the ReplicaSet was in backoff and the pod never existed, there was nothing visible from kubectl get pods — just “No resources found.” This is the quiet version of a failure: no crash loops, no error pods, no logs. Just absence. The deployment existed, the ReplicaSet existed, but the pod count was zero.
The most dangerous failures are the silent ones. A CrashLoopBackOff screams at you. A FailedCreate with exponential backoff whispers once every 15 minutes and then goes quiet.
Why This Went Unnoticed for 11 Hours
Three factors conspired to hide this — and honestly, I’m a little embarrassed by it:
The cluster rebuild was noisy. During recovery from the previous outage, there were dozens of pods starting up, PVCs binding, ArgoCD syncing. Transmission’s absence was one signal lost in the noise of a full cluster rebuild. I was focused on getting everything else back up and didn’t check whether “everything” actually meant everything.
No alerting on FailedCreate. Uptime Kuma monitors HTTP endpoints, but Transmission’s web UI was never exposed to Uptime Kuma (it’s behind the VPN sidecar, not directly reachable). There was no monitor that would catch “deployment has 0/1 ready replicas.”
Exponential backoff hides the problem. The ReplicaSet tried to create the pod at T+0, T+1m, T+2m, T+4m, T+8m, T+16m… By T+11h, it was retrying roughly every 15 minutes. The Events section only retains recent events, so by the time I looked, only the most recent handful of failures were visible.
If a deployment’s desired replica count doesn’t match its ready count for more than 10 minutes, something is wrong. That should be a first-class alert, not something I discover when I want to download something.
Part 2: PodSecurity Levels and the NET_ADMIN Problem
What Each PodSecurity Level Actually Blocks
The previous post-mortem introduced PodSecurity Standards and the three levels. This incident reveals a subtlety that one didn’t: the gap between baseline and privileged isn’t just about “privileged containers.”
Here’s what each level blocks for capabilities specifically:
| Standard | Capability Rules | What This Means |
|---|---|---|
| privileged | No restrictions | Any capability can be added |
| baseline | Only allows AUDIT_WRITE, CHOWN, DAC_OVERRIDE, FKILL, FSETID, KILL, MKNOD, NET_BIND_SERVICE, SETFCAP, SETGID, SETPCAP, SETUID, SYS_CHROOT | These are the default Docker capabilities. Anything else is rejected. |
| restricted | Must drop ALL, may only add NET_BIND_SERVICE | Almost no capabilities allowed |
NET_ADMIN is not in the baseline allowed list. This capability is needed for:
- Creating and managing network interfaces (WireGuard tunnels)
- Modifying routing tables (
ip route,ip rule) - Managing iptables/nftables firewall rules
- Setting socket options
For a VPN sidecar like Gluetun, NET_ADMIN is not optional. Without it, the container can’t create the WireGuard tunnel, can’t set up routing to force traffic through the tunnel, and can’t configure the firewall rules that prevent traffic leaks. The VPN sidecar pattern fundamentally requires this capability.
The Decision: Privileged Enforcement for Apps
The fix was to change the apps namespace from baseline to privileged enforcement:
metadata:
name: apps
labels:
# Privileged enforce: Gluetun (VPN sidecar) requires NET_ADMIN capability
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/warn: restricted
This keeps warn: restricted so API responses still flag non-compliant pods — you’ll see warnings during kubectl apply and kubectl rollout restart for any pod that doesn’t meet the restricted standard. The enforcement just doesn’t block creation.
Why not move Transmission to its own namespace? I considered this. A dedicated vpn namespace with privileged enforcement would keep apps at baseline. But it adds significant plumbing: new PV/PVC pairs (PVCs can’t cross namespaces), new NetworkPolicies, a new ArgoCD application. For a single-node homelab with one VPN-dependent app, the complexity cost exceeds the security benefit. The CI validation script (validate-podsecurity.sh) enforces allowPrivilegeEscalation: false as a project convention on all apps containers anyway, which catches the most common privilege escalation vectors.
Why not use a PodSecurity exemption? Kubernetes 1.28+ supports namespace-level exemptions for specific users, namespaces, and runtime classes — but not for specific pods or deployments. You can’t say “allow NET_ADMIN for just the transmission deployment.” It’s all-or-nothing at the namespace level.
The “Will This Break the Container?” Question
When adding security restrictions to containers, the critical question is: does the container’s internal process model require the capability being restricted?
For allowPrivilegeEscalation: false specifically:
- GitLab Omnibus: Can’t use it. GitLab runs Chef internally, which uses
su/chpstto switch between service UIDs (git:998, gitlab-www:999, registry, etc.). ThePR_SET_NO_NEW_PRIVSflag thatallowPrivilegeEscalation: falsesets would block these UID transitions. - GitLab Runner: Safe to add. The runner is a single Go binary that communicates with the Kubernetes API. It doesn’t switch UIDs or use setuid binaries.
- Gluetun: Has
allowPrivilegeEscalation: falsebut needsNET_ADMINcapability added explicitly. It doesn’t need privilege escalation — it needs a specific capability granted at container start.
Security contexts aren’t one-size-fits-all. Before adding restrictions, understand the container’s process model. allowPrivilegeEscalation: false is safe for single-process Go/Node/Python apps. It breaks multi-service containers that use su/sudo/chpst internally. Capabilities are orthogonal — a container can drop privilege escalation while still having NET_ADMIN.
Part 3: Stale IP Rules — The Second Time Bomb
The First Fix Didn’t Fix It
After changing the apps namespace to privileged enforcement, ArgoCD synced, but the existing ReplicaSet was in exponential backoff and wouldn’t retry immediately. A kubectl rollout restart forced a new ReplicaSet. The pod was created successfully — the PodSecurity gate was gone.
And then:
kubectl logs deployment/transmission -n apps -c gluetun --tail=10
INFO [wireguard] Using available kernelspace implementation
INFO [wireguard] Connecting to 79.135.104.58:51820
INFO [wireguard] if you are using Kubernetes, this may fix the error below:
https://github.com/qdm12/gluetun-wiki/blob/main/setup/advanced/kubernetes.md#adding-ipv6-rule--file-exists
ERROR [vpn] adding IPv6 rule: adding ip rule 101: from all to all table 51820:
netlink receive: file exists
INFO [vpn] retrying in 15s
Of course. A second problem was hiding behind the first one.
The VPN tunnel never established. Gluetun retried every 15s, then 30s, then exponentially, hitting the same error every time. Meanwhile, the transmission container was spewing sendto: Operation not permitted — its network traffic was being blocked by Gluetun’s firewall rules (which were set up before the tunnel, as a kill switch to prevent traffic leaks).
Root Cause: Stale IP Rules from Previous Pod
In Kubernetes, sidecar containers in the same pod share a network namespace. When Gluetun starts, it creates IP routing rules to direct traffic through the WireGuard tunnel:
ip rule add from all to all table 51820
ip -6 rule add from all to all table 51820
When the pod is destroyed cleanly, Gluetun’s shutdown handler removes these rules. But when the pod is abruptly terminated — by a node restart, a hard kill, or a ReplicaSet deleting the pod — the shutdown handler doesn’t run. The rules linger in the node’s network namespace.
The new pod gets a new network namespace — but in k3d, the “node” is a Docker container with a single network namespace that persists across pod recreations. The stale rules from the previous Gluetun instance were still present. When the new Gluetun tried to add the same rules, the kernel returned EEXIST — “file exists.”
Same time bomb pattern as PodSecurity: invisible state that only causes problems when something triggers pod recreation after an unclean shutdown.
Normal pod lifecycle:
Start → Create IP rules → VPN runs → Graceful shutdown → Delete IP rules → Clean
Abrupt termination:
Start → Create IP rules → VPN runs → SIGKILL → Rules left behind
Next pod startup:
Start → Create IP rules → EEXIST → "file exists" → Retry loop → Never connects
The Fix: postStart Lifecycle Hook
The Gluetun wiki documents this exact issue and provides the fix: a postStart lifecycle hook that clears stale rules before Gluetun tries to add them:
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "(ip rule del table 51820; ip -6 rule del table 51820) || true"]
The || true ensures the hook doesn’t fail if the rules don’t exist (clean start). If they do exist (dirty restart), they’re cleaned up before Gluetun’s main process runs.
After adding this hook, pushing to git, and letting ArgoCD sync:
INFO [wireguard] Connecting to 79.135.104.58:51820
INFO [ip getter] Public IP address is 139.28.218.8 (Canada, Quebec, Montréal)
VPN connected. Transmission operational. Public IP confirmed as a ProtonVPN exit node in Montreal. Finally.
Why This Worked Before (and Stopped Working)
This bug has always existed in the Gluetun configuration. The postStart hook was never present. So why did Transmission work before?
Because the pod had been running continuously since it was first deployed. It was never abruptly terminated. The WireGuard IP rules were created once, on the first pod start, and stayed valid for the lifetime of that pod. No stale rules to conflict with because there was no previous dirty shutdown.
The cluster hard reset created the conditions for this bug to manifest: the node (k3d container) was destroyed and recreated, any lingering state was wiped, but then when Gluetun’s pod was created for the first time on the fresh cluster, the IPv6 rule creation hit a different issue — the k3d node’s network namespace already had a conflicting rule from the k3s initialization.
“It worked before” is not a safety argument. If your deployment configuration doesn’t handle unclean shutdowns, it’s a time bomb waiting for the next node restart, cluster rebuild, or spot instance termination. Defense-in-depth means handling the unhappy path, even if you’ve never seen it fail.
Part 4: Action Items and What I Learned
Lessons Learned
This incident adds two patterns to the playbook, both variations on the time bomb theme from the previous outage.
1. When fixing a class of bug, fix it everywhere — not just where you found it.
The previous outage fixed the networking namespace. I didn’t check apps. The CI validation job (validate-podsecurity.sh) was added as an action item from the previous post-mortem, but it didn’t exist yet when the cluster was rebuilt. If it had, it would’ve caught the NET_ADMIN incompatibility before deployment. If one namespace had a PodSecurity time bomb, every other namespace might too.
2. Sidecar containers with kernel-level operations need shutdown/startup hygiene.
Gluetun modifies IP routing tables, iptables rules, and WireGuard interfaces. These are kernel-level state changes that persist beyond the container’s lifetime. Any container that modifies shared kernel state must handle both clean and dirty startup:
- Clean start: No stale state. Normal initialization.
- Dirty start: Stale state from previous instance. Must be cleaned up before re-initialization.
The postStart lifecycle hook pattern handles this. The general form:
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "<cleanup stale state> || true"]
If running the startup sequence twice causes a conflict, the container is fragile.
3. Silent deployment failures need dedicated alerting.
11 hours of downtime because I didn’t notice 0/1 replicas. Uptime Kuma checks HTTP endpoints but doesn’t check Kubernetes deployment health. A simple check for desired != ready replicas would’ve caught this immediately. That’s embarrassing.
Action Items
P0 — Done (This Incident)
| # | Action | Status |
|---|---|---|
| 1 | Change apps namespace from baseline to privileged enforcement | ✅ Done |
| 2 | Add postStart lifecycle hook to Gluetun for stale IP rule cleanup | ✅ Done |
| 3 | Add CI validate-podsecurity.sh job (from previous post-mortem) | ✅ Done |
| 4 | Exempt GitLab Omnibus from baseline checks (legitimately needs privilege escalation) | ✅ Done |
| 5 | Add allowPrivilegeEscalation: false to GitLab Runner | ✅ Done |
| 6 | Update docs: AGENTS.md, networking.md, k3d-cluster.md | ✅ Done |
P1 — Do This Week
| # | Action | Why |
|---|---|---|
| 1 | Add Uptime Kuma monitor for Transmission | clutch.yacksmith.ca should be monitored. Even though it’s behind a VPN sidecar, the CF Tunnel route exists and should return a response. |
| 2 | Add replica count alerting | Either via Uptime Kuma’s Kubernetes monitor type, or a cron job that checks kubectl get deployments -A -o json for availableReplicas < replicas. Catches FailedCreate, stuck rollouts, and any deployment with 0 ready pods. |
| 3 | Audit all sidecar containers for shutdown hygiene | Check every initContainer with restartPolicy: Always (sidecar pattern) for kernel state modifications that could leave stale artifacts. Add postStart cleanup hooks where needed. |
P2 — Do When Convenient
| # | Action | Why |
|---|---|---|
| 4 | Add PodSecurity capability checks to validate-podsecurity.sh | The current script checks allowPrivilegeEscalation and restricted fields. It should also verify that any container adding capabilities (like NET_ADMIN) is in a namespace that allows them. Catches the exact scenario where a manifest works in one namespace but would fail if moved to another. |
| 5 | Document the VPN sidecar pattern in knowledge/ | Gluetun + Transmission is a non-trivial deployment pattern: shared network namespace, NET_ADMIN requirement, kill switch, stale IP rule handling. Document it for future reference and any new VPN-dependent apps. |
The Chained Failure Pattern
This incident is a textbook example of chained failures — multiple independent problems that each contribute to a longer outage:
Chain 1: PodSecurity baseline blocks NET_ADMIN
→ Pod never created
→ 11 hours of silent downtime
→ Fix: change namespace to privileged
Chain 2: Stale IPv6 IP rules from dirty shutdown
→ VPN tunnel can't establish
→ Transmission has no network
→ Fix: postStart lifecycle hook
Chain 3: No monitoring for deployment health
→ Nobody noticed 0/1 replicas for 11 hours
→ Fix: add replica count alerting (pending)
Each chain had to be broken independently. Fixing the PodSecurity policy revealed the IP rule problem. Neither would’ve been caught by the existing monitoring.
After fixing one problem, verify the system actually works end-to-end. Don’t assume that removing one blocker means everything else is fine. Check logs, check connectivity, check the actual user-facing behavior. I learned that twice in one day.
Second post-mortem from the same day, caused by the same root pattern: invisible state that only manifests during pod recreation. The previous outage was PodSecurity in networking; this one was PodSecurity in apps plus stale kernel state from Gluetun. Same lesson: test your manifests against admission policies proactively, and design containers to handle dirty startup. I’m getting the message.