Blog/Debugging by stripping down

From ~esantoro

I want to take note of a "debugging" (troubleshooting really) session and about something that I did unconsciously, and that I thought about later, and realized it was a great step (gg me).

(Sorry for the technical rant).

The troubleshooting session

I have been pulled in a debugging session for a problem that has been affecting some of my colleagues for about a week. I had identified the issue was not with infrastructure, so I pulled myself out of the discussion pretty quickly.

One week passed, and I was pulled back into the discussion to root cause the issue.

The troubleshooting call took two full hours and i finally gave up and started swearing and that led to finding the root cause of the issue (yep, swearing strongly correlates to finding root cause of issues -- that's science!).

To give some context, a deployment in Kubernetes was failing, after a software update. Each instance of such deployments spawned three containers, one of which would fail on boot.

Now, I pulled myself out of this issue because the developers had decided to use s6 as a process manager within the container. If you decide to use a process manager within a container within a pod within kubernetes then anything that happens to you after the container starts is none of my problems. You picked that, you troubleshoot that.

The issue is, s6 was complaining about wanting to be pid 1 but somebody else was already pid1.

I replaced the execution of s6 with a dumb sleep 1d and watched the pod come to life: I could debug further.

The thing is, who was holding pid 1? It was a pause process. Now, I've seen that in the past (podman anyone?) so I didn't give it much thought.

Who's launching that process? It wasn't the command parameter for any of the containers. Heck, there was no pause binary any of the container images at all.

At this point I started removing one of the three containers. Down to two, problem still present, one more container to remove, to isolate the issue.

Basically: let's strip this to the minimal state and get this running.

Then maybe we'll re-add the previously removed pieces one by one and see what's preventing this from starting.

Except... There was no other container to remove. Only at this point I realized the deployment only defined two containers. Actually, the container I had already removed from the spec was the "second container". So according to spec, the pods had to have only one container yet there were two.

Only at this point things started to click.

Something else must be adding that container. That quickly made me look for a mutating web-hook and... There actually was a web-hook doing exactly that (injecting additional containers) on pods deployed in namespaces having a certain label.

I removed the label and rollout-restarted that deployment, and voilà, no extra extra container, no pause binary running, s6 could happily live as pid 1.

Now, the pod didn't work anyway as that extra container was doing actual work after all but s6 could start.

The root cause was found, and the developers now know what's bothering them and can go implement my initial suggestion ("get rid of s6").

I could finally pull myself out of this issue again, this time with a bit of satisfaction (I kinda like troubleshooting stuff).

This piece isn't really about pods

Essentially this boils down to the cause of problems becoming evident (or at least much easier to spot) when you remove noise, when you strip down the situation to the bare minimum, when you iteratively remove more and more until only the thing remaining is what's supposed to be working. At this point the issue is very likely to become self-evident, almost trivial to spot (in hindsight, at least).

I wonder to what else in life I can apply such approach.

I wonder what else should I "strip down" to a minimum.