I ran docker restart k3d-homelab-server-0 and my SSH session froze. Then it disconnected. Then I realized the SSH tunnel runs inside the cluster I just restarted.
That was the beginning of a 75-minute full outage that taught me more about my own infrastructure than the previous six months of it working fine.
| Date | 2026-03-08 |
| Duration | ~75 minutes |
| Severity | Full outage — all services down, no remote access |
| Trigger | docker restart k3d-homelab-server-0 to pick up containerd registry config |
Each part covers a different failure mode and the debugging methodology behind it. The specific technologies will change; the patterns won’t.
Timeline (Quick Reference)
| Time | Event |
|---|---|
| T+0 | Ran docker restart k3d-homelab-server-0 to reload registries.yaml |
| T+0 | SSH connection drops — CF tunnel runs inside the cluster |
| T+5m | Repeated SSH attempts fail with websocket: bad handshake |
| T+10m | Confirmed from server: cluster is up, but tunnel pod is down |
| T+15m | Attempted basic pod recovery — cluster too broken |
| T+20m | Ran hard-reset.sh — full cluster destroy + recreate |
| T+25m | Hard reset completes, but sealed secrets fail (namespaces don’t exist yet) |
| T+30m | Skipped manual re-seal step, applied apps via ArgoCD |
| T+35m | Cloudflared pod stuck in FailedCreate — PodSecurity violation |
| T+40m | Identified missing securityContext fields in cloudflared + gitlab-agent manifests |
| T+45m | Pushed fix to gitlab.com, ArgoCD syncs, tunnel comes back up |
| T+50m | Discovered 20+ pods stuck Pending — PVCs not binding to PVs |
| T+55m | Found PVs missing environment:prod labels that PVC selectors require |
| T+60m | Labeled PVs, all pods start scheduling |
| T+75m | Full recovery — all services operational |
Part 1: The Incident — What Happened and How We Knew
The Setup
All I wanted to do was add an insecure container registry (k3d-registry.localhost:5050) to the k3d cluster’s containerd config. Edit /etc/rancher/k3s/registries.yaml inside the k3d node container, restart k3s so containerd picks up the change. Simple.
The seemingly obvious way to do this:
docker restart k3d-homelab-server-0
Here is what actually happened.
T+0: The Command
That single command killed every running process inside the k3d container. In a k3d cluster, the Docker container is the node. Restarting it is equivalent to pulling the power cord on a bare-metal server. Every pod dies simultaneously — no graceful shutdown, no drain, no eviction. The kubelet, the API server, etcd, CoreDNS, every application pod. All gone at once.
T+0: The Moment I Knew
The SSH session froze. Then it disconnected. And that’s when it hit me — my SSH connection doesn’t go through a normal network path. It goes through a Cloudflare Tunnel, implemented by a cloudflared pod running in the networking namespace. Inside the very cluster I just restarted.
┌──────────────────────────────────┐
laptop ──SSH──► CF Edge ──tunnel──► cloudflared pod ──► sshd
│ (networking namespace) │
│ INSIDE k3d cluster │
└──────────────────────────────────┘
The access path to debug the cluster runs through the cluster. When the cluster’s down, so is my ability to fix it. I can’t SSH in to fix the thing that lets me SSH in.
This is the blast radius problem — and I’d just learned it the hard way. The command’s effect extended beyond the cluster workloads to include my ability to observe and fix the cluster.
T+5m: Retries and Confirmation
Repeated ssh ms attempts returned websocket: bad handshake. That error comes from the Cloudflare edge — the tunnel endpoint is unreachable, so the WebSocket upgrade that carries the SSH session can’t complete. Each retry confirmed the same thing: the tunnel was down, and it wasn’t coming back on its own.
T+10m: Physical Access
Since remote access was gone, recovery meant walking over to the server. From the local terminal:
kubectl get pods -n networking
# cloudflared-tunnel-xxxxx 0/1 CrashLoopBackOff 3 4m
The cluster had come back up (the k3d container restart did restart k3s), but the cloudflared pod was crash-looping. The node restart had corrupted enough cluster state that pods weren’t recovering cleanly.
T+15m through T+20m: Triage and Decision
Quick triage showed multiple problems: pods in CrashLoopBackOff, PVCs in Pending, secrets missing. The cluster was in a half-alive state that would take longer to untangle than to rebuild. Decision: run hard-reset.sh and start fresh.
This is one of those judgment calls that gets easier with experience — debugging a half-broken cluster can take hours. A full rebuild from a known-good script takes 20 minutes. When the blast radius is “everything,” the fastest path to recovery is often a clean rebuild, not surgical repair.
The Blast Radius Concept
Blast radius is the total set of things that break when something goes wrong. Most operators think about the direct effect (“this restarts the node”) but miss the transitive effects (“the node hosts the tunnel that provides my access to the node”). I certainly did.
To map blast radius, ask three questions:
- What runs on this thing? For
k3d-homelab-server-0: everything. Every pod, every service, the entire control plane. - What depends on those things? Every application, every ingress route, DNS resolution, the tunnel, monitoring, backups — all of it.
- Does my access path depend on any of those things? Yes. SSH goes through the Cloudflare tunnel pod. If the tunnel dies, I’m locked out remotely.
If the answer to question 3 is “yes,” you either (a) have an alternate access path ready, (b) are physically present at the machine, or (c) don’t run the command. I should’ve asked myself that before pressing Enter.
The Pre-Flight Checklist
Every infrastructure change should pass this checklist before execution. It takes 60 seconds and prevents hours-long outages:
- Blast radius: What’s the worst case if this goes wrong? Write it down.
- Rollback plan: Can I undo this? How long will it take? What state will I be in?
- Alternate access: If this breaks my primary access path, how do I get in?
- Non-destructive test: Can I validate this change without applying it? (
--dry-run=server,docker execto inspect config, etc.) - Minimal change: Is there a less disruptive way to achieve the same goal? (For registry config: recreate the cluster with
--registry-configat creation time instead of restarting a running node.)
The blast radius of a command includes your ability to observe its effects. If it can take down your monitoring, your access path, or your ability to roll back — you’re flying blind the moment you press Enter.
What the Right Approach Would Have Been
The safe way to add a container registry to a k3d cluster is to pass it at cluster creation time:
k3d cluster create homelab \
--registry-use k3d-registry.localhost:5050 \
--registry-config registries.yaml \
...
If you need to change the registry config on a running cluster, the correct approach is:
- Verify you have physical access or a backup access path
docker exec -it k3d-homelab-server-0 cat /etc/rancher/k3s/registries.yaml— inspect current state- Make the change
- Understand that restarting the node will cause a full outage
- Plan for the outage window accordingly
Or better yet: destroy and recreate the cluster with the correct config. In a single-node k3d homelab, this is actually less risky than a node restart because hard-reset.sh follows a tested, ordered sequence rather than hoping everything comes back cleanly after a cold restart.
Part 2: Reading Error Messages Like an SRE — PodSecurity Violations
The Problem at T+35m
After hard-reset.sh recreated the cluster and ArgoCD started syncing applications, the cloudflared tunnel pod was stuck. Instead of Running, it showed FailedCreate:
kubectl get pods -n networking
# NAME READY STATUS RESTARTS AGE
# cloudflared-tunnel-7f8b4d6c9-x2k4j 0/1 FailedCreate 0 2m
Most people would start googling “FailedCreate kubernetes” at this point. That’s backwards. The error message itself tells you exactly what happened — if you know how to read it.
The kubectl Debugging Ladder
Here’s something that took me too long to internalize: information about a failure lives one level up from where you see the symptom. The pod shows FailedCreate, but the pod doesn’t know why it failed — the thing that tried to create it does.
The ownership chain in Kubernetes:
Deployment → ReplicaSet → Pod
When a pod fails to create, the ReplicaSet is the object that attempted the creation and received the error. So the debugging ladder is:
# Step 1: See the symptom (pod level)
kubectl get pods -n networking
# Step 2: Look one level up (replicaset level) — THIS is where the error lives
kubectl describe replicaset -n networking -l app=cloudflared-tunnel
# Step 3: If needed, look two levels up (deployment level)
kubectl describe deployment cloudflared-tunnel -n networking
Running kubectl describe on the ReplicaSet revealed the actual error in its Events section:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 2m replicaset-controller Error creating: pods "cloudflared-tunnel-7f8b4d6c9-x2k4j"
is forbidden: violates PodSecurity "restricted:latest":
allowPrivilegeEscalation != false
(container "cloudflared" must set securityContext.allowPrivilegeEscalation=false),
unrestricted capabilities
(container "cloudflared" must set securityContext.capabilities.drop=["ALL"]),
runAsNonRoot != true
(pod or container "cloudflared" must set securityContext.runAsNonRoot=true),
seccompProfile
(pod or container "cloudflared" must set securityContext.seccompProfile.type
to "RuntimeDefault" or "Localhost")
Kubernetes errors are precise. They tell you exactly what’s wrong and exactly what to fix. Don’t scan for keywords — parse every clause.
Parsing the Error Message
Breaking it down piece by piece:
is forbidden— The API server rejected the pod creation. This isn’t a runtime failure; the pod was never created.violates PodSecurity "restricted:latest"— Thenetworkingnamespace has a PodSecurity Standard set torestrictedat theenforcelevel. The pod spec doesn’t meet this standard.allowPrivilegeEscalation != false— The container must explicitly setsecurityContext.allowPrivilegeEscalation: false.unrestricted capabilities— The container must drop all Linux capabilities withcapabilities.drop: ["ALL"].runAsNonRoot != true— The pod or container must setrunAsNonRoot: true.seccompProfile— The pod or container must set a seccomp profile of typeRuntimeDefaultorLocalhost.
The error message is literally a checklist of what to add to the manifest. Each line is a missing field. Kubernetes is being helpful here — you just have to read it.
PodSecurity Standards 101
Kubernetes has three PodSecurity Standards, from most to least permissive:
| Standard | What it allows | Use case |
|---|---|---|
| privileged | Everything. No restrictions. | System-level infrastructure (CNI, storage drivers) |
| baseline | Blocks known privilege escalations. Allows most workloads. | General applications |
| restricted | Hardened. Requires explicit security settings. | Security-sensitive namespaces |
Each standard can be applied at three enforcement levels:
| Level | Behavior |
|---|---|
| enforce | Reject pods that violate the standard. Pod is never created. |
| warn | Allow creation, but add a warning to the API response. |
| audit | Allow creation silently, but log the violation. |
The networking namespace in this cluster has:
metadata:
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/warn: restricted
This means: reject any pod that doesn’t meet the restricted standard. No exceptions, no grace period, no warnings-only mode. If your securityContext isn’t perfect, the pod doesn’t get created.
The Time Bomb Pattern
Here’s the insidious part: PodSecurity admission checks only happen at pod creation time, not on running pods.
Before the hard reset, the cloudflared pod was running fine. It had been deployed months ago, before the restricted enforcement label was added to the networking namespace. The pod was already running when the policy was applied, so it was never checked against the new policy. Everything looked healthy.
The hard reset destroyed and recreated the cluster. When ArgoCD re-deployed the cloudflared manifest, the admission controller checked the pod spec against the restricted standard for the first time — and rejected it.
This is the time bomb pattern: a violation exists in your manifests but is invisible because the affected pods are already running. The bomb goes off when something forces pod recreation — a node restart, a cluster rebuild, a rollout triggered by any config change.
Timeline of a PodSecurity time bomb:
Month 1: Deploy cloudflared (no securityContext issues, namespace has no policy)
Month 2: Add pod-security.kubernetes.io/enforce: restricted to namespace
→ Running pods are NOT checked. No error. No warning.
Month 3: Everything looks fine. kubectl get pods shows Running.
Month 6: Cluster hard reset. Pods recreated.
→ BOOM. FailedCreate. Tunnel is down. You're locked out.
A policy that only validates on creation is a time bomb. If you can’t test it against running workloads, you have to test it proactively before you need it.
Proactive Validation with –dry-run=server
You can validate your manifests against PodSecurity policies without actually creating the pods:
# Validates against the REAL admission controller, including PodSecurity
kubectl apply --dry-run=server -f kubernetes/networking/base/cloudflared.yaml
The --dry-run=server flag sends the request through the full API server admission pipeline, including PodSecurity checks, but doesn’t persist the result. If the manifest would be rejected, you’ll see the exact same error message as a real apply — but nothing breaks.
This is fundamentally different from --dry-run=client, which only validates YAML syntax locally and tells you nothing about server-side admission.
# Client-side: only checks YAML syntax. Useless for PodSecurity.
kubectl apply --dry-run=client -f cloudflared.yaml # "unchanged" (LIES)
# Server-side: full admission check. Catches PodSecurity violations.
kubectl apply --dry-run=server -f cloudflared.yaml # Error! (TRUTH)
The Fix
Straightforward — add the required securityContext fields to both cloudflared.yaml and gitlab-agent.yaml in the networking namespace:
securityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 65532
capabilities:
drop: ["ALL"]
seccompProfile:
type: RuntimeDefault
Pushed to gitlab.com (which is external and always accessible, unlike the self-hosted GitLab inside the cluster). ArgoCD synced the change, the cloudflared pod was created successfully, and the tunnel came back up at T+45m.
Lesson learned: add kubectl apply --dry-run=server to CI for every manifest. Catch admission violations in CI, not during a 2 AM outage recovery.
Part 3: Tracing Dependencies — Why 20 Pods Were Stuck Pending
The Problem at T+50m
With the cloudflared tunnel back up, SSH access was restored. A quick kubectl get pods --all-namespaces revealed the next surprise: over 20 pods stuck in Pending. Not CrashLoopBackOff, not Error, not FailedCreate — just Pending. They hadn’t even been scheduled to a node yet.
kubectl get pods -n apps
# NAME READY STATUS RESTARTS AGE
# jellyfin-6b8f7c9d4-abc12 0/1 Pending 0 10m
# audiobookshelf-5d4c8b7f2-def34 0/1 Pending 0 10m
# gitlab-ee-7a9e6d5c3-ghi56 0/1 Pending 0 10m
# ... (20+ more)
At this point I’m thinking: of course there’s another problem.
The Pod Dependency Chain
Before a pod can be scheduled and started, Kubernetes has to satisfy all of its dependencies:
Pod scheduling requires:
1. A node with enough CPU/memory (resource requests)
2. Node selectors / affinity rules satisfied
3. All PersistentVolumeClaims (PVCs) bound to PersistentVolumes (PVs)
4. All referenced Secrets exist
5. All referenced ConfigMaps exist
6. The ServiceAccount exists
7. No taints blocking the pod (unless tolerations match)
If any of these are unmet, the pod stays Pending. The scheduler won’t even attempt to place it.
Reading Scheduler Events
First debugging step for a Pending pod is always kubectl describe pod:
kubectl describe pod jellyfin-6b8f7c9d4-abc12 -n apps
Scroll to the Events section at the bottom:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 10m default-scheduler 0/1 nodes are available:
1 node(s) didn't find available persistent volumes to bind. preemption:
0/1 nodes are available: 1 Preemption is not helpful for scheduling.
Parsing it:
0/1 nodes are available— There’s 1 node in the cluster. Zero of them can run this pod.1 node(s) didn't find available persistent volumes to bind— The single node failed because a PVC couldn’t bind to a PV. That’s the root cause.preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling— Even evicting other pods wouldn’t help. This confirms it’s not resource contention — it’s a missing dependency.
Scheduler events tell you exactly which dependency is unmet. “Didn’t find available persistent volumes” means the PVC/PV binding is broken. Don’t guess — read the event.
PV/PVC Binding First Principles
For a PVC to bind to a PV, all of the following must match:
| Field | PVC spec | PV spec | Must match? |
|---|---|---|---|
| StorageClass | storageClassName: local-storage | storageClassName: local-storage | Exact match |
| Access modes | accessModes: [ReadWriteOnce] | accessModes: [ReadWriteOnce] | PV must include PVC’s modes |
| Capacity | resources.requests.storage: 50Gi | capacity.storage: 50Gi | PV capacity >= PVC request |
| Volume mode | volumeMode: Filesystem | volumeMode: Filesystem | Exact match |
| Label selector | selector.matchLabels: {environment: prod} | labels: {environment: prod} | PV must have all labels PVC selects on |
That last one — the label selector — is the trap.
The Debugging Flow
Systematic approach for debugging a PVC that won’t bind:
# Step 1: Check PVC status
kubectl get pvc -n apps
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
# appdata-pvc Pending local-storage 10m
# Step 2: Describe the PVC for events
kubectl describe pvc appdata-pvc -n apps
# Events:
# Warning ProvisioningFailed ... no persistent volumes available for this claim
# and no storage class is set
# Step 3: Check what PVs exist
kubectl get pv
# NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS AGE
# appdata-pv 50Gi RWO Retain Available local-storage 10m
# data-pv 900Gi RWO Retain Available local-storage 10m
At this point, everything looks like it should match. The PV exists, it’s Available (not already bound), the storage class matches, the capacity matches, the access modes match. So why won’t it bind?
The Invisible Label Selector
The PVC definition includes a label selector:
# In the app manifest (e.g., jellyfin.yaml)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: appdata-pvc
namespace: apps
spec:
storageClassName: local-storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
selector:
matchLabels:
environment: prod
And the PV is supposed to have that label:
# In storage.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: appdata-pv
labels:
environment: prod # <-- This label is required for PVC binding
Here’s what happened: when hard-reset.sh calls apply_storage(), it runs storage.yaml through sed to substitute the node name for node affinity. The PVs were created correctly from the file — but during the broken initial recovery attempt at T+15m (before the full hard reset), the PVs had been manually deleted and recreated without the labels. The subsequent hard-reset.sh run’s apply_storage() recreated them again from the file, but the PVs from the botched manual recovery were still present and took precedence.
The fix:
# Add the missing label to the PVs
kubectl label pv appdata-pv environment=prod
kubectl label pv data-pv environment=prod
Within seconds, PVCs bound to PVs, and all 20+ pods transitioned from Pending to ContainerCreating to Running.
Why This Bug Is Insidious
This is one of the hardest Kubernetes bugs to spot. Here’s why:
kubectl get pvdoesn’t show labels. The default output columns are NAME, CAPACITY, ACCESS MODES, RECLAIM POLICY, STATUS, CLAIM, STORAGECLASS, and AGE. Labels aren’t visible. You have to explicitly ask:kubectl get pv --show-labelsorkubectl describe pv.kubectl get pvcdoesn’t show the selector. The default output shows STATUS, VOLUME, CAPACITY, ACCESS MODES, and STORAGECLASS. The label selector is invisible. You have tokubectl describe pvcorkubectl get pvc -o yaml.Everything else matches. Storage class, capacity, access modes, volume mode — it all looks correct. The only mismatch is a label selector that neither the PV listing nor the PVC listing shows by default.
The error message doesn’t mention labels. The scheduler says “didn’t find available persistent volumes to bind.” It doesn’t say “PV exists but labels don’t match.” You have to figure out the label mismatch yourself.
When everything “looks right” but doesn’t work, check the fields that aren’t visible in the default output. Labels, annotations, selectors, and finalizers are the usual culprits. Use -o yaml or --show-labels to see the full picture. I should’ve done that five minutes earlier.
The Systematic Debugging Flow for Pending Pods
Use this every time you see a Pending pod:
1. kubectl describe pod <name> -n <namespace>
└── Read the Events section. What dependency is unmet?
2. If "persistent volumes":
a. kubectl get pvc -n <namespace>
└── Is the PVC Pending or Bound?
b. kubectl get pv --show-labels
└── Does a matching PV exist? Is it Available?
c. Compare PVC spec vs PV spec field by field:
- storageClassName
- accessModes
- capacity (PV >= PVC)
- selector.matchLabels vs PV labels ← THE TRAP
- volumeMode
3. If "insufficient cpu/memory":
└── Check node allocatable vs pod requests
4. If "node affinity/selector":
└── Check node labels vs pod nodeSelector/affinity
5. If "taints":
└── Check node taints vs pod tolerations
Part 4: Circular Dependencies — The Secrets Chicken-and-Egg Problem
The Problem at T+25m
When hard-reset.sh ran apply_secrets(), every sealed secret application failed:
Error from server (NotFound): namespaces "argocd" not found
Error from server (NotFound): namespaces "networking" not found
Error from server (NotFound): namespaces "apps" not found
[INFO] Secrets applied successfully
Read that last line again. Every single apply failed, and the script reported success. The find ... -exec kubectl apply -f {} \; command doesn’t propagate individual failures to the script’s exit code, and there’s no error checking around it. The script marched forward, oblivious, cheerfully announcing that zero secrets were “applied successfully.”
I love writing recovery scripts that lie to me during a recovery.
But there’s a deeper problem. Even if the ordering were fixed and namespaces existed, the sealed secrets themselves would still be useless.
How Sealed Secrets Work (and Why a New Cluster Breaks Them)
Sealed Secrets solves the “secrets in git” problem. You can’t store Kubernetes Secret manifests in git because they’re only base64-encoded, not encrypted. Sealed Secrets encrypts them with a public key so they can live safely in a repository.
The lifecycle:
1. Sealed Secrets controller generates an RSA key pair at install time
2. You encrypt secrets using the PUBLIC key (kubeseal --cert cert.pem)
3. The encrypted SealedSecret YAML goes into git
4. The controller in the cluster decrypts them using its PRIVATE key
5. Regular Kubernetes Secret objects are created from the decrypted data
The critical detail: the key pair is unique to each controller installation. When you destroy a cluster and create a new one, the new Sealed Secrets controller generates a new key pair. The old encrypted secrets — encrypted with the old public key — can’t be decrypted by the new private key. They’re cryptographic garbage.
Old cluster: PublicKey_A encrypts → SealedSecret → PrivateKey_A decrypts ✓
New cluster: PublicKey_A encrypts → SealedSecret → PrivateKey_B decrypts ✗ FAIL
After a cluster recreation, you must:
- Fetch the new certificate:
kubeseal --fetch-cert > cert.pem - Re-encrypt every secret with the new certificate
- Apply the new sealed secrets to the new cluster
The hard-reset.sh script handles step 1 (fetch_new_cert) and expects a manual step for step 2 (the wait_for_secrets function that prompts you to run a GitLab CI pipeline). But here’s where it gets fun…
The Circular Dependency Map
The re-sealing process depends on a GitLab CI pipeline. But GitLab runs inside the cluster.
To recover the cluster, you need:
→ Sealed secrets applied (so pods can access credentials)
→ Secrets re-sealed with new cert (old ones are garbage)
→ GitLab CI pipeline to re-seal secrets
→ GitLab running inside the cluster
→ The cluster being recovered ← CIRCULAR
This isn’t the only circular dependency. Here’s the full map:
| Dependency | What needs it | What provides it | Circular? |
|---|---|---|---|
| GitLab CI for re-sealing | Cluster recovery | GitLab pod in cluster | Yes |
| CF tunnel for SSH | Remote debugging | Cloudflared pod in cluster | Yes |
| ArgoCD for app deployment | App manifests | ArgoCD pod in cluster | Yes (partially — bootstrap is manual) |
| DNS for git pull | Script fetches from git | Cluster CoreDNS (only for in-cluster) | No (host DNS is independent) |
Three of these four are circular. The cluster needs things that run inside the cluster. If your recovery path depends on the thing you’re recovering, you don’t have a recovery path — you have a wish. Every critical dependency needs an out-of-cluster fallback.
Script Ordering Bugs and Silent Failures
The execution order in hard-reset.sh:
main() {
preflight_checks
create_directories
delete_cluster # Step 1: Destroy everything
create_cluster # Step 2: Fresh k3d cluster
apply_storage # Step 3: PVs and StorageClass
apply_secrets # Step 4: Sealed secrets ← FAILS (no namespaces yet!)
fetch_new_cert # Step 5: Get new cert
push_new_cert # Step 6: Push cert to git
wait_for_secrets # Step 7: Manual re-seal step
apply_apps # Step 8: ArgoCD + namespaces ← namespaces created HERE
wait_for_pods
setup_firewall
}
apply_secrets runs at step 4. Namespaces are created inside apply_apps at step 8, because apply_apps applies kubernetes/cluster/namespaces.yaml as part of the cluster resources. The secrets target namespaces (apps, networking, argocd) that don’t exist yet.
And apply_secrets() in cluster-lib.sh uses:
find "${REPO_ROOT}/kubernetes/secrets" -name "*-sealed-secret.yaml" -exec kubectl apply -f {} \;
The find -exec pattern doesn’t fail the overall command when individual kubectl apply calls fail. Each failed apply prints an error to stderr, but the exit code of find itself is 0 as long as the traversal succeeded. The script continues, prints “Secrets applied successfully,” and nobody notices that zero secrets were actually applied.
This is the silent failure pattern: error messages go to stderr, nothing checks for them, nothing counts them, and the script’s happy-path logging actively lies about the outcome. A script that crashes on error is annoying but honest. A script that swallows errors and prints “Success” is actively dangerous.
The DR Litmus Test
Here’s a test worth applying to every disaster recovery procedure:
Can you recover from zero without the thing you’re recovering?
Walk through your recovery script mentally and, for each step, ask: “Does this step require something that only exists inside the cluster?” If yes, that step will fail during a real DR scenario.
For hard-reset.sh:
| Step | Requires in-cluster component? | Fails during DR? |
|---|---|---|
delete_cluster | No (k3d CLI) | No |
create_cluster | No (k3d CLI) | No |
apply_storage | No (kubectl + local files) | No |
apply_secrets | Namespaces (not yet created) | Yes |
fetch_new_cert | Sealed Secrets controller | No (installed by apply_secrets) |
wait_for_secrets | GitLab CI pipeline | Yes |
apply_apps | No (helm + kubectl + local files) | No |
Two steps fail. One’s an ordering bug (fixable). The other is a fundamental circular dependency (requires architectural change).
Breaking Circular Dependencies
Four strategies for breaking circular dependencies in infrastructure:
1. External bootstrap store. Keep secret values outside the cluster in a location that’s always accessible. This cluster uses gitlab.com CI/CD variables (external SaaS, not the self-hosted GitLab). During DR, glab variable get can retrieve every secret value without needing the cluster.
2. Self-contained scripts. The recovery script should embed or locally cache everything it needs. Instead of depending on a CI pipeline to re-seal secrets, hard-reset.sh should include a seal-secrets.sh that reads values from environment variables or a local file, seals them with the new cert, and applies them — all without network dependencies beyond the local cluster.
3. Fail loud, not silent. If a step fails, the script must stop and tell you. The || true pattern and unchecked find -exec swallow errors. Replace them with explicit error tracking:
# Bad: silent failure
find "${REPO_ROOT}/kubernetes/secrets" -name "*-sealed-secret.yaml" -exec kubectl apply -f {} \;
log_info "Secrets applied successfully"
# Good: fail loud
local failures=0
while IFS= read -r -d '' secret_file; do
if ! kubectl apply -f "$secret_file"; then
log_error "Failed to apply: $secret_file"
((failures++))
fi
done < <(find "${REPO_ROOT}/kubernetes/secrets" -name "*-sealed-secret.yaml" -print0)
if [[ $failures -gt 0 ]]; then
log_error "$failures secret(s) failed to apply"
exit 1
fi
4. Test the DR path. Run hard-reset.sh (or dr-test.sh) regularly in a non-production context. The cluster already has a dr-test.sh script that creates a separate k3d cluster for testing — use it. A DR procedure that’s never been tested isn’t a DR procedure; it’s a hope.
Circular dependencies are invisible during normal operations. They only surface during recovery — the exact moment you can least afford surprises. Map them, break them, and test the breaks.
Part 5: Action Items and What I Learned
Lessons Learned
This incident exposed five patterns through direct, painful experience. Each one will repeat across different systems and contexts — which is exactly why I’m writing them down.
1. Your access path is part of your blast radius.
The Cloudflare tunnel runs as a pod inside the cluster. Any operation that disrupts the cluster also disrupts my ability to observe, debug, and recover it. This isn’t unique to tunnels — it applies to monitoring (if Prometheus is down, you can’t see that things are down), logging (if Loki is down, you can’t see why), and CI/CD (if ArgoCD is down, you can’t deploy fixes). Before touching infrastructure, trace your access path and verify it doesn’t pass through the thing you’re touching. I didn’t, and that’s how I ended up walking to my server room.
2. Admission policies are time bombs when applied to running workloads.
PodSecurity Standards only validate at pod creation time. A pod created before the policy was applied will run indefinitely without being checked. The violation only becomes visible when the pod is recreated — during a rollout, a node drain, or a cluster rebuild. Months of green dashboards hiding manifests that will fail on the next restart. --dry-run=server or CI linting is the only defense.
3. The fields you can’t see in default output are the ones that bite you.
PV labels, PVC selectors, finalizers, annotations, owner references — none of these appear in kubectl get output by default. When debugging a binding or scheduling failure where “everything looks right,” the problem is almost always in a field you’re not looking at. Train yourself to reach for -o yaml, --show-labels, and kubectl describe before concluding that something is broken at a deeper level.
4. Silent failures in scripts are worse than crashes.
hard-reset.sh applied zero secrets, logged “Secrets applied successfully,” and continued to the next step. During an outage, when you’re stressed and moving fast, you will trust the script’s output. If that output lies, you’ll waste time debugging the wrong thing. set -euo pipefail is the starting point, not the finish line.
5. Circular dependencies are invisible until recovery time.
During normal operations, GitLab CI runs inside the cluster and re-seals secrets on demand. The tunnel provides SSH access. ArgoCD syncs manifests. Everything works because everything’s already running. The circular dependencies only surface when you need to bootstrap from zero. Test your recovery procedure from zero, not from a half-working state.
Action Items
P0 — Do Immediately
| # | Action | Why |
|---|---|---|
| 1 | Add --registry-config to k3d cluster creation in cluster-lib.sh | Eliminates the need to ever restart a running node for registry config. The trigger for this entire incident disappears. |
| 2 | Fix hard-reset.sh ordering: create namespaces before applying secrets | Move kubectl apply -f kubernetes/cluster/namespaces.yaml before apply_secrets(), or have apply_secrets() create target namespaces if they don’t exist. |
| 3 | Fix hard-reset.sh error handling: fail loudly on secret apply errors | Replace find -exec kubectl apply with a loop that counts and reports failures. Remove || true from any kubectl apply call that matters. |
| 4 | Add PodSecurity validation to CI or pre-commit | Run kubectl apply --dry-run=server or use a policy linter (e.g., kyverno CLI, kubeconform with policy plugins) to validate all manifests against their target namespace’s PodSecurity level before merge. |
P1 — Do This Week
| # | Action | Why |
|---|---|---|
| 5 | Make seal-secrets.sh self-contained for DR | Add an inline sealing step to hard-reset.sh that reads secret values from environment variables or a local file, seals them with the new cert, and applies them. Remove the dependency on GitLab CI for re-sealing during recovery. |
| 6 | Document “never restart the k3d node” in knowledge/k3d-cluster.md | Explicitly state that docker restart k3d-homelab-server-0 is forbidden. Document the safe alternatives: recreate with hard-reset.sh or pass config at creation time. |
| 7 | Add PV label verification to apply_storage() in cluster-lib.sh | After applying storage.yaml, verify that all PVs have the expected environment: prod label. Alert if any are missing. |
P2 — Do When Convenient
| # | Action | Why |
|---|---|---|
| 8 | Set up a secondary access path (Tailscale or host-level SSH) | A lightweight VPN or SSH daemon running on the host (not in k3d) provides backup access when the tunnel is down. Eliminates the single point of failure for remote access. |
| 9 | Run dr-test.sh monthly | The DR test script creates a separate k3d cluster and validates the full recovery path. Running it regularly catches ordering bugs, circular dependencies, and manifest drift before they matter. |
| 10 | Add a pre-flight blast radius check to dangerous scripts | Before any destructive operation, print what will be affected and require explicit confirmation. “This will restart k3d-homelab-server-0. This will terminate ALL pods including the CF tunnel. Remote access will be lost. Continue? (yes/no)” |
Before Touching Infrastructure: The Checklist
┌─────────────────────────────────────────────────────────────┐
│ BEFORE TOUCHING INFRASTRUCTURE │
│ │
│ □ BLAST RADIUS │
│ What is the worst case if this goes wrong? │
│ Does my access path go through the thing I'm changing? │
│ │
│ □ ROLLBACK PLAN │
│ Can I undo this? How? How long will it take? │
│ What state will the system be in if I need to roll back? │
│ │
│ □ ALTERNATE ACCESS │
│ If this breaks my primary access, how do I get in? │
│ Is physical access available if needed? │
│ │
│ □ NON-DESTRUCTIVE TEST │
│ Can I validate this with --dry-run=server? │
│ Can I inspect the current state without changing it? │
│ Can I test this on a non-production cluster first? │
│ │
│ □ MINIMAL CHANGE │
│ Is there a less disruptive way to achieve this? │
│ Am I changing one thing, or am I changing many? │
│ Can I scope this to one namespace/pod/node? │
│ │
│ If any box is unchecked, STOP. Fill it in first. │
└─────────────────────────────────────────────────────────────┘
The specific technologies will change — k3d, Cloudflare, Sealed Secrets — but the patterns won’t: blast radius analysis, reading error messages precisely, tracing dependency chains, and breaking circular dependencies. These are the skills that turn a 75-minute outage into a 15-minute one next time. Assuming I actually use the checklist.