Cluster Outage from k3d Node Restart

I ran docker restart k3d-homelab-server-0 and my SSH session froze. Then it disconnected. Then I realized the SSH tunnel runs inside the cluster I just restarted.

That was the beginning of a 75-minute full outage that taught me more about my own infrastructure than the previous six months of it working fine.


Date	2026-03-08
Duration	~75 minutes
Severity	Full outage — all services down, no remote access
Trigger	`docker restart k3d-homelab-server-0` to pick up containerd registry config

Each part covers a different failure mode and the debugging methodology behind it. The specific technologies will change; the patterns won’t.

Timeline (Quick Reference)

Time	Event
T+0	Ran `docker restart k3d-homelab-server-0` to reload `registries.yaml`
T+0	SSH connection drops — CF tunnel runs inside the cluster
T+5m	Repeated SSH attempts fail with `websocket: bad handshake`
T+10m	Confirmed from server: cluster is up, but tunnel pod is down
T+15m	Attempted basic pod recovery — cluster too broken
T+20m	Ran `hard-reset.sh` — full cluster destroy + recreate
T+25m	Hard reset completes, but sealed secrets fail (namespaces don’t exist yet)
T+30m	Skipped manual re-seal step, applied apps via ArgoCD
T+35m	Cloudflared pod stuck in `FailedCreate` — PodSecurity violation
T+40m	Identified missing `securityContext` fields in cloudflared + gitlab-agent manifests
T+45m	Pushed fix to gitlab.com, ArgoCD syncs, tunnel comes back up
T+50m	Discovered 20+ pods stuck Pending — PVCs not binding to PVs
T+55m	Found PVs missing `environment:prod` labels that PVC selectors require
T+60m	Labeled PVs, all pods start scheduling
T+75m	Full recovery — all services operational

Part 1: The Incident — What Happened and How We Knew

The Setup

All I wanted to do was add an insecure container registry (k3d-registry.localhost:5050) to the k3d cluster’s containerd config. Edit /etc/rancher/k3s/registries.yaml inside the k3d node container, restart k3s so containerd picks up the change. Simple.

The seemingly obvious way to do this:

docker restart k3d-homelab-server-0

Here is what actually happened.

T+0: The Command

That single command killed every running process inside the k3d container. In a k3d cluster, the Docker container is the node. Restarting it is equivalent to pulling the power cord on a bare-metal server. Every pod dies simultaneously — no graceful shutdown, no drain, no eviction. The kubelet, the API server, etcd, CoreDNS, every application pod. All gone at once.

T+0: The Moment I Knew

The SSH session froze. Then it disconnected. And that’s when it hit me — my SSH connection doesn’t go through a normal network path. It goes through a Cloudflare Tunnel, implemented by a cloudflared pod running in the networking namespace. Inside the very cluster I just restarted.

                    ┌──────────────────────────────────┐
 laptop ──SSH──► CF Edge ──tunnel──► cloudflared pod ──► sshd
                    │         (networking namespace)    │
                    │         INSIDE k3d cluster        │
                    └──────────────────────────────────┘

The access path to debug the cluster runs through the cluster. When the cluster’s down, so is my ability to fix it. I can’t SSH in to fix the thing that lets me SSH in.

This is the blast radius problem — and I’d just learned it the hard way. The command’s effect extended beyond the cluster workloads to include my ability to observe and fix the cluster.

T+5m: Retries and Confirmation

Repeated ssh ms attempts returned websocket: bad handshake. That error comes from the Cloudflare edge — the tunnel endpoint is unreachable, so the WebSocket upgrade that carries the SSH session can’t complete. Each retry confirmed the same thing: the tunnel was down, and it wasn’t coming back on its own.

T+10m: Physical Access

Since remote access was gone, recovery meant walking over to the server. From the local terminal:

kubectl get pods -n networking
# cloudflared-tunnel-xxxxx   0/1   CrashLoopBackOff   3   4m

The cluster had come back up (the k3d container restart did restart k3s), but the cloudflared pod was crash-looping. The node restart had corrupted enough cluster state that pods weren’t recovering cleanly.

T+15m through T+20m: Triage and Decision

Quick triage showed multiple problems: pods in CrashLoopBackOff, PVCs in Pending, secrets missing. The cluster was in a half-alive state that would take longer to untangle than to rebuild. Decision: run hard-reset.sh and start fresh.

This is one of those judgment calls that gets easier with experience — debugging a half-broken cluster can take hours. A full rebuild from a known-good script takes 20 minutes. When the blast radius is “everything,” the fastest path to recovery is often a clean rebuild, not surgical repair.

The Blast Radius Concept

Blast radius is the total set of things that break when something goes wrong. Most operators think about the direct effect (“this restarts the node”) but miss the transitive effects (“the node hosts the tunnel that provides my access to the node”). I certainly did.

To map blast radius, ask three questions:

What runs on this thing? For k3d-homelab-server-0: everything. Every pod, every service, the entire control plane.
What depends on those things? Every application, every ingress route, DNS resolution, the tunnel, monitoring, backups — all of it.
Does my access path depend on any of those things? Yes. SSH goes through the Cloudflare tunnel pod. If the tunnel dies, I’m locked out remotely.

If the answer to question 3 is “yes,” you either (a) have an alternate access path ready, (b) are physically present at the machine, or (c) don’t run the command. I should’ve asked myself that before pressing Enter.

The Pre-Flight Checklist

Every infrastructure change should pass this checklist before execution. It takes 60 seconds and prevents hours-long outages:

Blast radius: What’s the worst case if this goes wrong? Write it down.
Rollback plan: Can I undo this? How long will it take? What state will I be in?
Alternate access: If this breaks my primary access path, how do I get in?
Non-destructive test: Can I validate this change without applying it? (--dry-run=server, docker exec to inspect config, etc.)
Minimal change: Is there a less disruptive way to achieve the same goal? (For registry config: recreate the cluster with --registry-config at creation time instead of restarting a running node.)

The blast radius of a command includes your ability to observe its effects. If it can take down your monitoring, your access path, or your ability to roll back — you’re flying blind the moment you press Enter.

What the Right Approach Would Have Been

The safe way to add a container registry to a k3d cluster is to pass it at cluster creation time:

k3d cluster create homelab \
  --registry-use k3d-registry.localhost:5050 \
  --registry-config registries.yaml \
  ...

If you need to change the registry config on a running cluster, the correct approach is:

Verify you have physical access or a backup access path
docker exec -it k3d-homelab-server-0 cat /etc/rancher/k3s/registries.yaml — inspect current state
Make the change
Understand that restarting the node will cause a full outage
Plan for the outage window accordingly

Or better yet: destroy and recreate the cluster with the correct config. In a single-node k3d homelab, this is actually less risky than a node restart because hard-reset.sh follows a tested, ordered sequence rather than hoping everything comes back cleanly after a cold restart.

Part 2: Reading Error Messages Like an SRE — PodSecurity Violations

The Problem at T+35m

After hard-reset.sh recreated the cluster and ArgoCD started syncing applications, the cloudflared tunnel pod was stuck. Instead of Running, it showed FailedCreate:

kubectl get pods -n networking
# NAME                                  READY   STATUS         RESTARTS   AGE
# cloudflared-tunnel-7f8b4d6c9-x2k4j   0/1     FailedCreate   0          2m

Most people would start googling “FailedCreate kubernetes” at this point. That’s backwards. The error message itself tells you exactly what happened — if you know how to read it.

The kubectl Debugging Ladder

Here’s something that took me too long to internalize: information about a failure lives one level up from where you see the symptom. The pod shows FailedCreate, but the pod doesn’t know why it failed — the thing that tried to create it does.

The ownership chain in Kubernetes:

Deployment → ReplicaSet → Pod

When a pod fails to create, the ReplicaSet is the object that attempted the creation and received the error. So the debugging ladder is:

# Step 1: See the symptom (pod level)
kubectl get pods -n networking

# Step 2: Look one level up (replicaset level) — THIS is where the error lives
kubectl describe replicaset -n networking -l app=cloudflared-tunnel

# Step 3: If needed, look two levels up (deployment level)
kubectl describe deployment cloudflared-tunnel -n networking

Running kubectl describe on the ReplicaSet revealed the actual error in its Events section:

Events:
  Type     Reason        Age   From                   Message
  ----     ------        ----  ----                   -------
  Warning  FailedCreate  2m    replicaset-controller  Error creating: pods "cloudflared-tunnel-7f8b4d6c9-x2k4j"
    is forbidden: violates PodSecurity "restricted:latest":
    allowPrivilegeEscalation != false
    (container "cloudflared" must set securityContext.allowPrivilegeEscalation=false),
    unrestricted capabilities
    (container "cloudflared" must set securityContext.capabilities.drop=["ALL"]),
    runAsNonRoot != true
    (pod or container "cloudflared" must set securityContext.runAsNonRoot=true),
    seccompProfile
    (pod or container "cloudflared" must set securityContext.seccompProfile.type
    to "RuntimeDefault" or "Localhost")

Kubernetes errors are precise. They tell you exactly what’s wrong and exactly what to fix. Don’t scan for keywords — parse every clause.

Parsing the Error Message

Breaking it down piece by piece:

is forbidden — The API server rejected the pod creation. This isn’t a runtime failure; the pod was never created.
violates PodSecurity "restricted:latest" — The networking namespace has a PodSecurity Standard set to restricted at the enforce level. The pod spec doesn’t meet this standard.
allowPrivilegeEscalation != false — The container must explicitly set securityContext.allowPrivilegeEscalation: false.
unrestricted capabilities — The container must drop all Linux capabilities with capabilities.drop: ["ALL"].
runAsNonRoot != true — The pod or container must set runAsNonRoot: true.
seccompProfile — The pod or container must set a seccomp profile of type RuntimeDefault or Localhost.

The error message is literally a checklist of what to add to the manifest. Each line is a missing field. Kubernetes is being helpful here — you just have to read it.

PodSecurity Standards 101

Kubernetes has three PodSecurity Standards, from most to least permissive:

Standard	What it allows	Use case
privileged	Everything. No restrictions.	System-level infrastructure (CNI, storage drivers)
baseline	Blocks known privilege escalations. Allows most workloads.	General applications
restricted	Hardened. Requires explicit security settings.	Security-sensitive namespaces

Each standard can be applied at three enforcement levels:

Level	Behavior
enforce	Reject pods that violate the standard. Pod is never created.
warn	Allow creation, but add a warning to the API response.
audit	Allow creation silently, but log the violation.

The networking namespace in this cluster has:

metadata:
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/warn: restricted

This means: reject any pod that doesn’t meet the restricted standard. No exceptions, no grace period, no warnings-only mode. If your securityContext isn’t perfect, the pod doesn’t get created.

The Time Bomb Pattern

Here’s the insidious part: PodSecurity admission checks only happen at pod creation time, not on running pods.

Before the hard reset, the cloudflared pod was running fine. It had been deployed months ago, before the restricted enforcement label was added to the networking namespace. The pod was already running when the policy was applied, so it was never checked against the new policy. Everything looked healthy.

The hard reset destroyed and recreated the cluster. When ArgoCD re-deployed the cloudflared manifest, the admission controller checked the pod spec against the restricted standard for the first time — and rejected it.

This is the time bomb pattern: a violation exists in your manifests but is invisible because the affected pods are already running. The bomb goes off when something forces pod recreation — a node restart, a cluster rebuild, a rollout triggered by any config change.

Timeline of a PodSecurity time bomb:

  Month 1: Deploy cloudflared (no securityContext issues, namespace has no policy)
  Month 2: Add pod-security.kubernetes.io/enforce: restricted to namespace
           → Running pods are NOT checked. No error. No warning.
  Month 3: Everything looks fine. kubectl get pods shows Running.
  Month 6: Cluster hard reset. Pods recreated.
           → BOOM. FailedCreate. Tunnel is down. You're locked out.

A policy that only validates on creation is a time bomb. If you can’t test it against running workloads, you have to test it proactively before you need it.

Proactive Validation with –dry-run=server

You can validate your manifests against PodSecurity policies without actually creating the pods:

# Validates against the REAL admission controller, including PodSecurity
kubectl apply --dry-run=server -f kubernetes/networking/base/cloudflared.yaml

The --dry-run=server flag sends the request through the full API server admission pipeline, including PodSecurity checks, but doesn’t persist the result. If the manifest would be rejected, you’ll see the exact same error message as a real apply — but nothing breaks.

This is fundamentally different from --dry-run=client, which only validates YAML syntax locally and tells you nothing about server-side admission.

# Client-side: only checks YAML syntax. Useless for PodSecurity.
kubectl apply --dry-run=client -f cloudflared.yaml  # "unchanged" (LIES)

# Server-side: full admission check. Catches PodSecurity violations.
kubectl apply --dry-run=server -f cloudflared.yaml  # Error! (TRUTH)

The Fix

Straightforward — add the required securityContext fields to both cloudflared.yaml and gitlab-agent.yaml in the networking namespace:

securityContext:
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  runAsNonRoot: true
  runAsUser: 65532
  capabilities:
    drop: ["ALL"]
  seccompProfile:
    type: RuntimeDefault

Pushed to gitlab.com (which is external and always accessible, unlike the self-hosted GitLab inside the cluster). ArgoCD synced the change, the cloudflared pod was created successfully, and the tunnel came back up at T+45m.

Lesson learned: add kubectl apply --dry-run=server to CI for every manifest. Catch admission violations in CI, not during a 2 AM outage recovery.

Part 3: Tracing Dependencies — Why 20 Pods Were Stuck Pending

The Problem at T+50m

With the cloudflared tunnel back up, SSH access was restored. A quick kubectl get pods --all-namespaces revealed the next surprise: over 20 pods stuck in Pending. Not CrashLoopBackOff, not Error, not FailedCreate — just Pending. They hadn’t even been scheduled to a node yet.

kubectl get pods -n apps
# NAME                              READY   STATUS    RESTARTS   AGE
# jellyfin-6b8f7c9d4-abc12         0/1     Pending   0          10m
# audiobookshelf-5d4c8b7f2-def34   0/1     Pending   0          10m
# gitlab-ee-7a9e6d5c3-ghi56        0/1     Pending   0          10m
# ... (20+ more)

At this point I’m thinking: of course there’s another problem.

The Pod Dependency Chain

Before a pod can be scheduled and started, Kubernetes has to satisfy all of its dependencies:

Pod scheduling requires:
  1. A node with enough CPU/memory (resource requests)
  2. Node selectors / affinity rules satisfied
  3. All PersistentVolumeClaims (PVCs) bound to PersistentVolumes (PVs)
  4. All referenced Secrets exist
  5. All referenced ConfigMaps exist
  6. The ServiceAccount exists
  7. No taints blocking the pod (unless tolerations match)

If any of these are unmet, the pod stays Pending. The scheduler won’t even attempt to place it.

Reading Scheduler Events

First debugging step for a Pending pod is always kubectl describe pod:

kubectl describe pod jellyfin-6b8f7c9d4-abc12 -n apps

Scroll to the Events section at the bottom:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  10m   default-scheduler  0/1 nodes are available:
    1 node(s) didn't find available persistent volumes to bind. preemption:
    0/1 nodes are available: 1 Preemption is not helpful for scheduling.

Parsing it:

0/1 nodes are available — There’s 1 node in the cluster. Zero of them can run this pod.
1 node(s) didn't find available persistent volumes to bind — The single node failed because a PVC couldn’t bind to a PV. That’s the root cause.
preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling — Even evicting other pods wouldn’t help. This confirms it’s not resource contention — it’s a missing dependency.

Scheduler events tell you exactly which dependency is unmet. “Didn’t find available persistent volumes” means the PVC/PV binding is broken. Don’t guess — read the event.

PV/PVC Binding First Principles

For a PVC to bind to a PV, all of the following must match:

Field	PVC spec	PV spec	Must match?
StorageClass	`storageClassName: local-storage`	`storageClassName: local-storage`	Exact match
Access modes	`accessModes: [ReadWriteOnce]`	`accessModes: [ReadWriteOnce]`	PV must include PVC’s modes
Capacity	`resources.requests.storage: 50Gi`	`capacity.storage: 50Gi`	PV capacity >= PVC request
Volume mode	`volumeMode: Filesystem`	`volumeMode: Filesystem`	Exact match
Label selector	`selector.matchLabels: {environment: prod}`	`labels: {environment: prod}`	PV must have all labels PVC selects on

That last one — the label selector — is the trap.

The Debugging Flow

Systematic approach for debugging a PVC that won’t bind:

# Step 1: Check PVC status
kubectl get pvc -n apps
# NAME           STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS    AGE
# appdata-pvc    Pending                                       local-storage   10m

# Step 2: Describe the PVC for events
kubectl describe pvc appdata-pvc -n apps
# Events:
#   Warning  ProvisioningFailed  ...  no persistent volumes available for this claim
#            and no storage class is set

# Step 3: Check what PVs exist
kubectl get pv
# NAME         CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS    AGE
# appdata-pv   50Gi       RWO            Retain           Available           local-storage   10m
# data-pv      900Gi      RWO            Retain           Available           local-storage   10m

At this point, everything looks like it should match. The PV exists, it’s Available (not already bound), the storage class matches, the capacity matches, the access modes match. So why won’t it bind?

The Invisible Label Selector

The PVC definition includes a label selector:

# In the app manifest (e.g., jellyfin.yaml)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: appdata-pvc
  namespace: apps
spec:
  storageClassName: local-storage
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  selector:
    matchLabels:
      environment: prod

And the PV is supposed to have that label:

# In storage.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: appdata-pv
  labels:
    environment: prod    # <-- This label is required for PVC binding

Here’s what happened: when hard-reset.sh calls apply_storage(), it runs storage.yaml through sed to substitute the node name for node affinity. The PVs were created correctly from the file — but during the broken initial recovery attempt at T+15m (before the full hard reset), the PVs had been manually deleted and recreated without the labels. The subsequent hard-reset.sh run’s apply_storage() recreated them again from the file, but the PVs from the botched manual recovery were still present and took precedence.

The fix:

# Add the missing label to the PVs
kubectl label pv appdata-pv environment=prod
kubectl label pv data-pv environment=prod

Within seconds, PVCs bound to PVs, and all 20+ pods transitioned from Pending to ContainerCreating to Running.

Why This Bug Is Insidious

This is one of the hardest Kubernetes bugs to spot. Here’s why:

kubectl get pv doesn’t show labels. The default output columns are NAME, CAPACITY, ACCESS MODES, RECLAIM POLICY, STATUS, CLAIM, STORAGECLASS, and AGE. Labels aren’t visible. You have to explicitly ask: kubectl get pv --show-labels or kubectl describe pv.
kubectl get pvc doesn’t show the selector. The default output shows STATUS, VOLUME, CAPACITY, ACCESS MODES, and STORAGECLASS. The label selector is invisible. You have to kubectl describe pvc or kubectl get pvc -o yaml.
Everything else matches. Storage class, capacity, access modes, volume mode — it all looks correct. The only mismatch is a label selector that neither the PV listing nor the PVC listing shows by default.
The error message doesn’t mention labels. The scheduler says “didn’t find available persistent volumes to bind.” It doesn’t say “PV exists but labels don’t match.” You have to figure out the label mismatch yourself.

When everything “looks right” but doesn’t work, check the fields that aren’t visible in the default output. Labels, annotations, selectors, and finalizers are the usual culprits. Use -o yaml or --show-labels to see the full picture. I should’ve done that five minutes earlier.

The Systematic Debugging Flow for Pending Pods

Use this every time you see a Pending pod:

1. kubectl describe pod <name> -n <namespace>
   └── Read the Events section. What dependency is unmet?

2. If "persistent volumes":
   a. kubectl get pvc -n <namespace>
      └── Is the PVC Pending or Bound?
   b. kubectl get pv --show-labels
      └── Does a matching PV exist? Is it Available?
   c. Compare PVC spec vs PV spec field by field:
      - storageClassName
      - accessModes
      - capacity (PV >= PVC)
      - selector.matchLabels vs PV labels  ← THE TRAP
      - volumeMode

3. If "insufficient cpu/memory":
   └── Check node allocatable vs pod requests

4. If "node affinity/selector":
   └── Check node labels vs pod nodeSelector/affinity

5. If "taints":
   └── Check node taints vs pod tolerations

Part 4: Circular Dependencies — The Secrets Chicken-and-Egg Problem

The Problem at T+25m

When hard-reset.sh ran apply_secrets(), every sealed secret application failed:

Error from server (NotFound): namespaces "argocd" not found
Error from server (NotFound): namespaces "networking" not found
Error from server (NotFound): namespaces "apps" not found
[INFO] Secrets applied successfully

Read that last line again. Every single apply failed, and the script reported success. The find ... -exec kubectl apply -f {} \; command doesn’t propagate individual failures to the script’s exit code, and there’s no error checking around it. The script marched forward, oblivious, cheerfully announcing that zero secrets were “applied successfully.”

I love writing recovery scripts that lie to me during a recovery.

But there’s a deeper problem. Even if the ordering were fixed and namespaces existed, the sealed secrets themselves would still be useless.

How Sealed Secrets Work (and Why a New Cluster Breaks Them)

Sealed Secrets solves the “secrets in git” problem. You can’t store Kubernetes Secret manifests in git because they’re only base64-encoded, not encrypted. Sealed Secrets encrypts them with a public key so they can live safely in a repository.

The lifecycle:

1. Sealed Secrets controller generates an RSA key pair at install time
2. You encrypt secrets using the PUBLIC key (kubeseal --cert cert.pem)
3. The encrypted SealedSecret YAML goes into git
4. The controller in the cluster decrypts them using its PRIVATE key
5. Regular Kubernetes Secret objects are created from the decrypted data

The critical detail: the key pair is unique to each controller installation. When you destroy a cluster and create a new one, the new Sealed Secrets controller generates a new key pair. The old encrypted secrets — encrypted with the old public key — can’t be decrypted by the new private key. They’re cryptographic garbage.

Old cluster:  PublicKey_A encrypts → SealedSecret → PrivateKey_A decrypts ✓
New cluster:  PublicKey_A encrypts → SealedSecret → PrivateKey_B decrypts ✗ FAIL

After a cluster recreation, you must:

Fetch the new certificate: kubeseal --fetch-cert > cert.pem
Re-encrypt every secret with the new certificate
Apply the new sealed secrets to the new cluster

The hard-reset.sh script handles step 1 (fetch_new_cert) and expects a manual step for step 2 (the wait_for_secrets function that prompts you to run a GitLab CI pipeline). But here’s where it gets fun…

The Circular Dependency Map

The re-sealing process depends on a GitLab CI pipeline. But GitLab runs inside the cluster.

To recover the cluster, you need:
  → Sealed secrets applied (so pods can access credentials)
    → Secrets re-sealed with new cert (old ones are garbage)
      → GitLab CI pipeline to re-seal secrets
        → GitLab running inside the cluster
          → The cluster being recovered  ← CIRCULAR

This isn’t the only circular dependency. Here’s the full map:

Dependency	What needs it	What provides it	Circular?
GitLab CI for re-sealing	Cluster recovery	GitLab pod in cluster	Yes
CF tunnel for SSH	Remote debugging	Cloudflared pod in cluster	Yes
ArgoCD for app deployment	App manifests	ArgoCD pod in cluster	Yes (partially — bootstrap is manual)
DNS for git pull	Script fetches from git	Cluster CoreDNS (only for in-cluster)	No (host DNS is independent)

Three of these four are circular. The cluster needs things that run inside the cluster. If your recovery path depends on the thing you’re recovering, you don’t have a recovery path — you have a wish. Every critical dependency needs an out-of-cluster fallback.

Script Ordering Bugs and Silent Failures

The execution order in hard-reset.sh:

main() {
    preflight_checks
    create_directories
    delete_cluster      # Step 1: Destroy everything
    create_cluster      # Step 2: Fresh k3d cluster
    apply_storage       # Step 3: PVs and StorageClass
    apply_secrets       # Step 4: Sealed secrets ← FAILS (no namespaces yet!)
    fetch_new_cert      # Step 5: Get new cert
    push_new_cert       # Step 6: Push cert to git
    wait_for_secrets    # Step 7: Manual re-seal step
    apply_apps          # Step 8: ArgoCD + namespaces ← namespaces created HERE
    wait_for_pods
    setup_firewall
}

apply_secrets runs at step 4. Namespaces are created inside apply_apps at step 8, because apply_apps applies kubernetes/cluster/namespaces.yaml as part of the cluster resources. The secrets target namespaces (apps, networking, argocd) that don’t exist yet.

And apply_secrets() in cluster-lib.sh uses:

find "${REPO_ROOT}/kubernetes/secrets" -name "*-sealed-secret.yaml" -exec kubectl apply -f {} \;

The find -exec pattern doesn’t fail the overall command when individual kubectl apply calls fail. Each failed apply prints an error to stderr, but the exit code of find itself is 0 as long as the traversal succeeded. The script continues, prints “Secrets applied successfully,” and nobody notices that zero secrets were actually applied.

This is the silent failure pattern: error messages go to stderr, nothing checks for them, nothing counts them, and the script’s happy-path logging actively lies about the outcome. A script that crashes on error is annoying but honest. A script that swallows errors and prints “Success” is actively dangerous.

The DR Litmus Test

Here’s a test worth applying to every disaster recovery procedure:

Can you recover from zero without the thing you’re recovering?

Walk through your recovery script mentally and, for each step, ask: “Does this step require something that only exists inside the cluster?” If yes, that step will fail during a real DR scenario.

For hard-reset.sh:

Step	Requires in-cluster component?	Fails during DR?
`delete_cluster`	No (k3d CLI)	No
`create_cluster`	No (k3d CLI)	No
`apply_storage`	No (kubectl + local files)	No
`apply_secrets`	Namespaces (not yet created)	Yes
`fetch_new_cert`	Sealed Secrets controller	No (installed by apply_secrets)
`wait_for_secrets`	GitLab CI pipeline	Yes
`apply_apps`	No (helm + kubectl + local files)	No

Two steps fail. One’s an ordering bug (fixable). The other is a fundamental circular dependency (requires architectural change).

Breaking Circular Dependencies

Four strategies for breaking circular dependencies in infrastructure:

1. External bootstrap store. Keep secret values outside the cluster in a location that’s always accessible. This cluster uses gitlab.com CI/CD variables (external SaaS, not the self-hosted GitLab). During DR, glab variable get can retrieve every secret value without needing the cluster.

2. Self-contained scripts. The recovery script should embed or locally cache everything it needs. Instead of depending on a CI pipeline to re-seal secrets, hard-reset.sh should include a seal-secrets.sh that reads values from environment variables or a local file, seals them with the new cert, and applies them — all without network dependencies beyond the local cluster.

3. Fail loud, not silent. If a step fails, the script must stop and tell you. The || true pattern and unchecked find -exec swallow errors. Replace them with explicit error tracking:

# Bad: silent failure
find "${REPO_ROOT}/kubernetes/secrets" -name "*-sealed-secret.yaml" -exec kubectl apply -f {} \;
log_info "Secrets applied successfully"

# Good: fail loud
local failures=0
while IFS= read -r -d '' secret_file; do
    if ! kubectl apply -f "$secret_file"; then
        log_error "Failed to apply: $secret_file"
        ((failures++))
    fi
done < <(find "${REPO_ROOT}/kubernetes/secrets" -name "*-sealed-secret.yaml" -print0)

if [[ $failures -gt 0 ]]; then
    log_error "$failures secret(s) failed to apply"
    exit 1
fi

4. Test the DR path. Run hard-reset.sh (or dr-test.sh) regularly in a non-production context. The cluster already has a dr-test.sh script that creates a separate k3d cluster for testing — use it. A DR procedure that’s never been tested isn’t a DR procedure; it’s a hope.

Circular dependencies are invisible during normal operations. They only surface during recovery — the exact moment you can least afford surprises. Map them, break them, and test the breaks.

Part 5: Action Items and What I Learned

Lessons Learned

This incident exposed five patterns through direct, painful experience. Each one will repeat across different systems and contexts — which is exactly why I’m writing them down.

1. Your access path is part of your blast radius.

The Cloudflare tunnel runs as a pod inside the cluster. Any operation that disrupts the cluster also disrupts my ability to observe, debug, and recover it. This isn’t unique to tunnels — it applies to monitoring (if Prometheus is down, you can’t see that things are down), logging (if Loki is down, you can’t see why), and CI/CD (if ArgoCD is down, you can’t deploy fixes). Before touching infrastructure, trace your access path and verify it doesn’t pass through the thing you’re touching. I didn’t, and that’s how I ended up walking to my server room.

2. Admission policies are time bombs when applied to running workloads.

PodSecurity Standards only validate at pod creation time. A pod created before the policy was applied will run indefinitely without being checked. The violation only becomes visible when the pod is recreated — during a rollout, a node drain, or a cluster rebuild. Months of green dashboards hiding manifests that will fail on the next restart. --dry-run=server or CI linting is the only defense.

3. The fields you can’t see in default output are the ones that bite you.

PV labels, PVC selectors, finalizers, annotations, owner references — none of these appear in kubectl get output by default. When debugging a binding or scheduling failure where “everything looks right,” the problem is almost always in a field you’re not looking at. Train yourself to reach for -o yaml, --show-labels, and kubectl describe before concluding that something is broken at a deeper level.

4. Silent failures in scripts are worse than crashes.

hard-reset.sh applied zero secrets, logged “Secrets applied successfully,” and continued to the next step. During an outage, when you’re stressed and moving fast, you will trust the script’s output. If that output lies, you’ll waste time debugging the wrong thing. set -euo pipefail is the starting point, not the finish line.

5. Circular dependencies are invisible until recovery time.

During normal operations, GitLab CI runs inside the cluster and re-seals secrets on demand. The tunnel provides SSH access. ArgoCD syncs manifests. Everything works because everything’s already running. The circular dependencies only surface when you need to bootstrap from zero. Test your recovery procedure from zero, not from a half-working state.

Action Items

P0 — Do Immediately

#	Action	Why
1	Add `--registry-config` to k3d cluster creation in `cluster-lib.sh`	Eliminates the need to ever restart a running node for registry config. The trigger for this entire incident disappears.
2	Fix `hard-reset.sh` ordering: create namespaces before applying secrets	Move `kubectl apply -f kubernetes/cluster/namespaces.yaml` before `apply_secrets()`, or have `apply_secrets()` create target namespaces if they don’t exist.
3	Fix `hard-reset.sh` error handling: fail loudly on secret apply errors	Replace `find -exec kubectl apply` with a loop that counts and reports failures. Remove `\|\| true` from any `kubectl apply` call that matters.
4	Add PodSecurity validation to CI or pre-commit	Run `kubectl apply --dry-run=server` or use a policy linter (e.g., `kyverno` CLI, `kubeconform` with policy plugins) to validate all manifests against their target namespace’s PodSecurity level before merge.

P1 — Do This Week

#	Action	Why
5	Make `seal-secrets.sh` self-contained for DR	Add an inline sealing step to `hard-reset.sh` that reads secret values from environment variables or a local file, seals them with the new cert, and applies them. Remove the dependency on GitLab CI for re-sealing during recovery.
6	Document “never restart the k3d node” in `knowledge/k3d-cluster.md`	Explicitly state that `docker restart k3d-homelab-server-0` is forbidden. Document the safe alternatives: recreate with `hard-reset.sh` or pass config at creation time.
7	Add PV label verification to `apply_storage()` in `cluster-lib.sh`	After applying storage.yaml, verify that all PVs have the expected `environment: prod` label. Alert if any are missing.

P2 — Do When Convenient

#	Action	Why
8	Set up a secondary access path (Tailscale or host-level SSH)	A lightweight VPN or SSH daemon running on the host (not in k3d) provides backup access when the tunnel is down. Eliminates the single point of failure for remote access.
9	Run `dr-test.sh` monthly	The DR test script creates a separate k3d cluster and validates the full recovery path. Running it regularly catches ordering bugs, circular dependencies, and manifest drift before they matter.
10	Add a pre-flight blast radius check to dangerous scripts	Before any destructive operation, print what will be affected and require explicit confirmation. “This will restart k3d-homelab-server-0. This will terminate ALL pods including the CF tunnel. Remote access will be lost. Continue? (yes/no)”

Before Touching Infrastructure: The Checklist

┌─────────────────────────────────────────────────────────────┐
│                BEFORE TOUCHING INFRASTRUCTURE                │
│                                                              │
│  □ BLAST RADIUS                                              │
│    What is the worst case if this goes wrong?                │
│    Does my access path go through the thing I'm changing?    │
│                                                              │
│  □ ROLLBACK PLAN                                             │
│    Can I undo this? How? How long will it take?              │
│    What state will the system be in if I need to roll back?  │
│                                                              │
│  □ ALTERNATE ACCESS                                          │
│    If this breaks my primary access, how do I get in?        │
│    Is physical access available if needed?                   │
│                                                              │
│  □ NON-DESTRUCTIVE TEST                                      │
│    Can I validate this with --dry-run=server?                │
│    Can I inspect the current state without changing it?      │
│    Can I test this on a non-production cluster first?        │
│                                                              │
│  □ MINIMAL CHANGE                                            │
│    Is there a less disruptive way to achieve this?           │
│    Am I changing one thing, or am I changing many?           │
│    Can I scope this to one namespace/pod/node?               │
│                                                              │
│  If any box is unchecked, STOP. Fill it in first.            │
└─────────────────────────────────────────────────────────────┘

The specific technologies will change — k3d, Cloudflare, Sealed Secrets — but the patterns won’t: blast radius analysis, reading error messages precisely, tracing dependency chains, and breaking circular dependencies. These are the skills that turn a 75-minute outage into a 15-minute one next time. Assuming I actually use the checklist.

Timeline (Quick Reference)#

Part 1: The Incident — What Happened and How We Knew#

The Setup#

T+0: The Command#

T+0: The Moment I Knew#

T+5m: Retries and Confirmation#

T+10m: Physical Access#

T+15m through T+20m: Triage and Decision#

The Blast Radius Concept#

The Pre-Flight Checklist#

What the Right Approach Would Have Been#

Part 2: Reading Error Messages Like an SRE — PodSecurity Violations#

The Problem at T+35m#

The kubectl Debugging Ladder#

Parsing the Error Message#

PodSecurity Standards 101#

The Time Bomb Pattern#

Proactive Validation with –dry-run=server#

The Fix#

Part 3: Tracing Dependencies — Why 20 Pods Were Stuck Pending#

The Problem at T+50m#

The Pod Dependency Chain#

Reading Scheduler Events#

PV/PVC Binding First Principles#

The Debugging Flow#

The Invisible Label Selector#

Why This Bug Is Insidious#

The Systematic Debugging Flow for Pending Pods#

Part 4: Circular Dependencies — The Secrets Chicken-and-Egg Problem#

The Problem at T+25m#

How Sealed Secrets Work (and Why a New Cluster Breaks Them)#

The Circular Dependency Map#

Script Ordering Bugs and Silent Failures#

The DR Litmus Test#

Breaking Circular Dependencies#

Part 5: Action Items and What I Learned#

Lessons Learned#

Action Items#

P0 — Do Immediately#

P1 — Do This Week#

P2 — Do When Convenient#

Before Touching Infrastructure: The Checklist#