Transmission Down 11+ Hours — PodSecurity Time Bomb

Eleven hours after the cluster rebuild, I tried to use Transmission and it wasn’t there. Not erroring. Not crash-looping. Just… not there. The pod had never been created, and I hadn’t noticed because nothing screamed at me about it.

Two chained time bombs, both invisible until the cluster rebuild forced pod recreation. The first one was a PodSecurity policy that silently blocked Gluetun’s NET_ADMIN capability. The second — stale IPv6 IP rules from the previous pod’s unclean shutdown — was hiding behind it.


Date	2026-03-08
Duration	11+ hours (transmission unreachable)
Severity	Single service outage — Transmission/Gluetun VPN stack completely down
Trigger	`apps` namespace PodSecurity `baseline` enforcement blocking Gluetun’s `NET_ADMIN` capability
Related	Follow-on from cluster outage post-mortem

Timeline (Quick Reference)

Time	Event
T+0 (previous incident)	Cluster hard reset after node restart outage
T+0 (previous incident)	PodSecurity `restricted` enforcement added to `networking` namespace
T+0 (this incident)	`apps` namespace already had `baseline` enforcement from previous security review
T+0	Transmission deployment created by ArgoCD after cluster rebuild
T+0	ReplicaSet fails to create pod — `FailedCreate` due to PodSecurity violation
T+11h	Issue noticed: “Still can’t use transmission”
T+11h+5m	Root cause identified: Gluetun `NET_ADMIN` blocked by `baseline` enforcement
T+11h+10m	Fix: `apps` namespace changed from `baseline` to `privileged` enforcement
T+11h+12m	ArgoCD synced namespace change, rollout restarted
T+11h+13m	Pod created, but Gluetun stuck in retry loop — IPv6 rule conflict
T+11h+20m	Second fix: `postStart` lifecycle hook to clear stale IP rules
T+11h+25m	Pushed, ArgoCD synced, VPN connected, transmission fully operational

Part 1: The Incident — A Second Time Bomb from the Same Root Cause

Context: The Previous Outage Created This One

Hours earlier, the cluster had been rebuilt after the node restart outage. During that recovery, I found that the networking namespace had restricted PodSecurity enforcement blocking cloudflared and gitlab-agent pods. Fixed those manifests, moved on.

What I didn’t check: whether the other namespace — apps — had a similar time bomb waiting. It did.

During a previous security review, the apps namespace had been given baseline enforcement:

metadata:
  name: apps
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/warn: restricted

When this was applied, every pod in apps was already running. The admission controller never checked them. The baseline standard seemed safe — it’s less strict than restricted, and most applications don’t need host networking or privileged containers.

But one application does need something baseline blocks: Gluetun, the VPN sidecar for Transmission, requires NET_ADMIN capability to create WireGuard tunnels and manage iptables firewall rules.

The Symptom: 11 Hours of Silence

After the cluster rebuild, ArgoCD synced all applications. Most pods came up fine. Transmission did not. The ReplicaSet entered FailedCreate state and began exponential backoff:

kubectl describe replicaset -n apps -l app=transmission

Events:
  Type     Reason        Age   From                   Message
  ----     ------        ----  ----                   -------
  Warning  FailedCreate  52m   replicaset-controller  Error creating: pods
    "transmission-64f594958d-ztvlm" is forbidden: violates PodSecurity
    "baseline:latest": non-default capabilities (container "gluetun" must
    not include "NET_ADMIN" in securityContext.capabilities.add)

The error is precise: baseline rejects any capability additions beyond the default set. NET_ADMIN isn’t in the default set. The pod was never created.

Because the ReplicaSet was in backoff and the pod never existed, there was nothing visible from kubectl get pods — just “No resources found.” This is the quiet version of a failure: no crash loops, no error pods, no logs. Just absence. The deployment existed, the ReplicaSet existed, but the pod count was zero.

The most dangerous failures are the silent ones. A CrashLoopBackOff screams at you. A FailedCreate with exponential backoff whispers once every 15 minutes and then goes quiet.

Why This Went Unnoticed for 11 Hours

Three factors conspired to hide this — and honestly, I’m a little embarrassed by it:

The cluster rebuild was noisy. During recovery from the previous outage, there were dozens of pods starting up, PVCs binding, ArgoCD syncing. Transmission’s absence was one signal lost in the noise of a full cluster rebuild. I was focused on getting everything else back up and didn’t check whether “everything” actually meant everything.
No alerting on FailedCreate. Uptime Kuma monitors HTTP endpoints, but Transmission’s web UI was never exposed to Uptime Kuma (it’s behind the VPN sidecar, not directly reachable). There was no monitor that would catch “deployment has 0/1 ready replicas.”
Exponential backoff hides the problem. The ReplicaSet tried to create the pod at T+0, T+1m, T+2m, T+4m, T+8m, T+16m… By T+11h, it was retrying roughly every 15 minutes. The Events section only retains recent events, so by the time I looked, only the most recent handful of failures were visible.

If a deployment’s desired replica count doesn’t match its ready count for more than 10 minutes, something is wrong. That should be a first-class alert, not something I discover when I want to download something.

Part 2: PodSecurity Levels and the NET_ADMIN Problem

What Each PodSecurity Level Actually Blocks

The previous post-mortem introduced PodSecurity Standards and the three levels. This incident reveals a subtlety that one didn’t: the gap between baseline and privileged isn’t just about “privileged containers.”

Here’s what each level blocks for capabilities specifically:

Standard	Capability Rules	What This Means
privileged	No restrictions	Any capability can be added
baseline	Only allows `AUDIT_WRITE`, `CHOWN`, `DAC_OVERRIDE`, `FKILL`, `FSETID`, `KILL`, `MKNOD`, `NET_BIND_SERVICE`, `SETFCAP`, `SETGID`, `SETPCAP`, `SETUID`, `SYS_CHROOT`	These are the default Docker capabilities. Anything else is rejected.
restricted	Must drop ALL, may only add `NET_BIND_SERVICE`	Almost no capabilities allowed

NET_ADMIN is not in the baseline allowed list. This capability is needed for:

Creating and managing network interfaces (WireGuard tunnels)
Modifying routing tables (ip route, ip rule)
Managing iptables/nftables firewall rules
Setting socket options

For a VPN sidecar like Gluetun, NET_ADMIN is not optional. Without it, the container can’t create the WireGuard tunnel, can’t set up routing to force traffic through the tunnel, and can’t configure the firewall rules that prevent traffic leaks. The VPN sidecar pattern fundamentally requires this capability.

The Decision: Privileged Enforcement for Apps

The fix was to change the apps namespace from baseline to privileged enforcement:

metadata:
  name: apps
  labels:
    # Privileged enforce: Gluetun (VPN sidecar) requires NET_ADMIN capability
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/warn: restricted

This keeps warn: restricted so API responses still flag non-compliant pods — you’ll see warnings during kubectl apply and kubectl rollout restart for any pod that doesn’t meet the restricted standard. The enforcement just doesn’t block creation.

Why not move Transmission to its own namespace? I considered this. A dedicated vpn namespace with privileged enforcement would keep apps at baseline. But it adds significant plumbing: new PV/PVC pairs (PVCs can’t cross namespaces), new NetworkPolicies, a new ArgoCD application. For a single-node homelab with one VPN-dependent app, the complexity cost exceeds the security benefit. The CI validation script (validate-podsecurity.sh) enforces allowPrivilegeEscalation: false as a project convention on all apps containers anyway, which catches the most common privilege escalation vectors.

Why not use a PodSecurity exemption? Kubernetes 1.28+ supports namespace-level exemptions for specific users, namespaces, and runtime classes — but not for specific pods or deployments. You can’t say “allow NET_ADMIN for just the transmission deployment.” It’s all-or-nothing at the namespace level.

The “Will This Break the Container?” Question

When adding security restrictions to containers, the critical question is: does the container’s internal process model require the capability being restricted?

For allowPrivilegeEscalation: false specifically:

GitLab Omnibus: Can’t use it. GitLab runs Chef internally, which uses su/chpst to switch between service UIDs (git:998, gitlab-www:999, registry, etc.). The PR_SET_NO_NEW_PRIVS flag that allowPrivilegeEscalation: false sets would block these UID transitions.
GitLab Runner: Safe to add. The runner is a single Go binary that communicates with the Kubernetes API. It doesn’t switch UIDs or use setuid binaries.
Gluetun: Has allowPrivilegeEscalation: false but needs NET_ADMIN capability added explicitly. It doesn’t need privilege escalation — it needs a specific capability granted at container start.

Security contexts aren’t one-size-fits-all. Before adding restrictions, understand the container’s process model. allowPrivilegeEscalation: false is safe for single-process Go/Node/Python apps. It breaks multi-service containers that use su/sudo/chpst internally. Capabilities are orthogonal — a container can drop privilege escalation while still having NET_ADMIN.

Part 3: Stale IP Rules — The Second Time Bomb

The First Fix Didn’t Fix It

After changing the apps namespace to privileged enforcement, ArgoCD synced, but the existing ReplicaSet was in exponential backoff and wouldn’t retry immediately. A kubectl rollout restart forced a new ReplicaSet. The pod was created successfully — the PodSecurity gate was gone.

And then:

kubectl logs deployment/transmission -n apps -c gluetun --tail=10

INFO [wireguard] Using available kernelspace implementation
INFO [wireguard] Connecting to 79.135.104.58:51820
INFO [wireguard] if you are using Kubernetes, this may fix the error below:
  https://github.com/qdm12/gluetun-wiki/blob/main/setup/advanced/kubernetes.md#adding-ipv6-rule--file-exists
ERROR [vpn] adding IPv6 rule: adding ip rule 101: from all to all table 51820:
  netlink receive: file exists
INFO [vpn] retrying in 15s

Of course. A second problem was hiding behind the first one.

The VPN tunnel never established. Gluetun retried every 15s, then 30s, then exponentially, hitting the same error every time. Meanwhile, the transmission container was spewing sendto: Operation not permitted — its network traffic was being blocked by Gluetun’s firewall rules (which were set up before the tunnel, as a kill switch to prevent traffic leaks).

Root Cause: Stale IP Rules from Previous Pod

In Kubernetes, sidecar containers in the same pod share a network namespace. When Gluetun starts, it creates IP routing rules to direct traffic through the WireGuard tunnel:

ip rule add from all to all table 51820
ip -6 rule add from all to all table 51820

When the pod is destroyed cleanly, Gluetun’s shutdown handler removes these rules. But when the pod is abruptly terminated — by a node restart, a hard kill, or a ReplicaSet deleting the pod — the shutdown handler doesn’t run. The rules linger in the node’s network namespace.

The new pod gets a new network namespace — but in k3d, the “node” is a Docker container with a single network namespace that persists across pod recreations. The stale rules from the previous Gluetun instance were still present. When the new Gluetun tried to add the same rules, the kernel returned EEXIST — “file exists.”

Same time bomb pattern as PodSecurity: invisible state that only causes problems when something triggers pod recreation after an unclean shutdown.

Normal pod lifecycle:
  Start → Create IP rules → VPN runs → Graceful shutdown → Delete IP rules → Clean

Abrupt termination:
  Start → Create IP rules → VPN runs → SIGKILL → Rules left behind

Next pod startup:
  Start → Create IP rules → EEXIST → "file exists" → Retry loop → Never connects

The Fix: postStart Lifecycle Hook

The Gluetun wiki documents this exact issue and provides the fix: a postStart lifecycle hook that clears stale rules before Gluetun tries to add them:

lifecycle:
  postStart:
    exec:
      command: ["/bin/sh", "-c", "(ip rule del table 51820; ip -6 rule del table 51820) || true"]

The || true ensures the hook doesn’t fail if the rules don’t exist (clean start). If they do exist (dirty restart), they’re cleaned up before Gluetun’s main process runs.

After adding this hook, pushing to git, and letting ArgoCD sync:

INFO [wireguard] Connecting to 79.135.104.58:51820
INFO [ip getter] Public IP address is 139.28.218.8 (Canada, Quebec, Montréal)

VPN connected. Transmission operational. Public IP confirmed as a ProtonVPN exit node in Montreal. Finally.

Why This Worked Before (and Stopped Working)

This bug has always existed in the Gluetun configuration. The postStart hook was never present. So why did Transmission work before?

Because the pod had been running continuously since it was first deployed. It was never abruptly terminated. The WireGuard IP rules were created once, on the first pod start, and stayed valid for the lifetime of that pod. No stale rules to conflict with because there was no previous dirty shutdown.

The cluster hard reset created the conditions for this bug to manifest: the node (k3d container) was destroyed and recreated, any lingering state was wiped, but then when Gluetun’s pod was created for the first time on the fresh cluster, the IPv6 rule creation hit a different issue — the k3d node’s network namespace already had a conflicting rule from the k3s initialization.

“It worked before” is not a safety argument. If your deployment configuration doesn’t handle unclean shutdowns, it’s a time bomb waiting for the next node restart, cluster rebuild, or spot instance termination. Defense-in-depth means handling the unhappy path, even if you’ve never seen it fail.

Part 4: Action Items and What I Learned

Lessons Learned

This incident adds two patterns to the playbook, both variations on the time bomb theme from the previous outage.

1. When fixing a class of bug, fix it everywhere — not just where you found it.

The previous outage fixed the networking namespace. I didn’t check apps. The CI validation job (validate-podsecurity.sh) was added as an action item from the previous post-mortem, but it didn’t exist yet when the cluster was rebuilt. If it had, it would’ve caught the NET_ADMIN incompatibility before deployment. If one namespace had a PodSecurity time bomb, every other namespace might too.

2. Sidecar containers with kernel-level operations need shutdown/startup hygiene.

Gluetun modifies IP routing tables, iptables rules, and WireGuard interfaces. These are kernel-level state changes that persist beyond the container’s lifetime. Any container that modifies shared kernel state must handle both clean and dirty startup:

Clean start: No stale state. Normal initialization.
Dirty start: Stale state from previous instance. Must be cleaned up before re-initialization.

The postStart lifecycle hook pattern handles this. The general form:

lifecycle:
  postStart:
    exec:
      command: ["/bin/sh", "-c", "<cleanup stale state> || true"]

If running the startup sequence twice causes a conflict, the container is fragile.

3. Silent deployment failures need dedicated alerting.

11 hours of downtime because I didn’t notice 0/1 replicas. Uptime Kuma checks HTTP endpoints but doesn’t check Kubernetes deployment health. A simple check for desired != ready replicas would’ve caught this immediately. That’s embarrassing.

Action Items

P0 — Done (This Incident)

#	Action	Status
1	Change `apps` namespace from `baseline` to `privileged` enforcement	✅ Done
2	Add `postStart` lifecycle hook to Gluetun for stale IP rule cleanup	✅ Done
3	Add CI `validate-podsecurity.sh` job (from previous post-mortem)	✅ Done
4	Exempt GitLab Omnibus from baseline checks (legitimately needs privilege escalation)	✅ Done
5	Add `allowPrivilegeEscalation: false` to GitLab Runner	✅ Done
6	Update docs: AGENTS.md, networking.md, k3d-cluster.md	✅ Done

P1 — Do This Week

#	Action	Why
1	Add Uptime Kuma monitor for Transmission	`clutch.yacksmith.ca` should be monitored. Even though it’s behind a VPN sidecar, the CF Tunnel route exists and should return a response.
2	Add replica count alerting	Either via Uptime Kuma’s Kubernetes monitor type, or a cron job that checks `kubectl get deployments -A -o json` for `availableReplicas < replicas`. Catches FailedCreate, stuck rollouts, and any deployment with 0 ready pods.
3	Audit all sidecar containers for shutdown hygiene	Check every initContainer with `restartPolicy: Always` (sidecar pattern) for kernel state modifications that could leave stale artifacts. Add `postStart` cleanup hooks where needed.

P2 — Do When Convenient

#	Action	Why
4	Add PodSecurity capability checks to `validate-podsecurity.sh`	The current script checks `allowPrivilegeEscalation` and `restricted` fields. It should also verify that any container adding capabilities (like `NET_ADMIN`) is in a namespace that allows them. Catches the exact scenario where a manifest works in one namespace but would fail if moved to another.
5	Document the VPN sidecar pattern in `knowledge/`	Gluetun + Transmission is a non-trivial deployment pattern: shared network namespace, `NET_ADMIN` requirement, kill switch, stale IP rule handling. Document it for future reference and any new VPN-dependent apps.

The Chained Failure Pattern

This incident is a textbook example of chained failures — multiple independent problems that each contribute to a longer outage:

Chain 1: PodSecurity baseline blocks NET_ADMIN
  → Pod never created
  → 11 hours of silent downtime
  → Fix: change namespace to privileged

Chain 2: Stale IPv6 IP rules from dirty shutdown
  → VPN tunnel can't establish
  → Transmission has no network
  → Fix: postStart lifecycle hook

Chain 3: No monitoring for deployment health
  → Nobody noticed 0/1 replicas for 11 hours
  → Fix: add replica count alerting (pending)

Each chain had to be broken independently. Fixing the PodSecurity policy revealed the IP rule problem. Neither would’ve been caught by the existing monitoring.

After fixing one problem, verify the system actually works end-to-end. Don’t assume that removing one blocker means everything else is fine. Check logs, check connectivity, check the actual user-facing behavior. I learned that twice in one day.

Second post-mortem from the same day, caused by the same root pattern: invisible state that only manifests during pod recreation. The previous outage was PodSecurity in networking; this one was PodSecurity in apps plus stale kernel state from Gluetun. Same lesson: test your manifests against admission policies proactively, and design containers to handle dirty startup. I’m getting the message.

Timeline (Quick Reference)#

Part 1: The Incident — A Second Time Bomb from the Same Root Cause#

Context: The Previous Outage Created This One#

The Symptom: 11 Hours of Silence#

Why This Went Unnoticed for 11 Hours#

Part 2: PodSecurity Levels and the NET_ADMIN Problem#

What Each PodSecurity Level Actually Blocks#

The Decision: Privileged Enforcement for Apps#

The “Will This Break the Container?” Question#

Part 3: Stale IP Rules — The Second Time Bomb#

The First Fix Didn’t Fix It#

Root Cause: Stale IP Rules from Previous Pod#

The Fix: postStart Lifecycle Hook#

Why This Worked Before (and Stopped Working)#

Part 4: Action Items and What I Learned#

Lessons Learned#

Action Items#

P0 — Done (This Incident)#

P1 — Do This Week#

P2 — Do When Convenient#

The Chained Failure Pattern#