Solve Every Problem Twice

December 3, 2019

One habit that I think every software developer, if not practically every professional in any field, can benefit from is that of solving every problem twice.

Watch my video on this topic, too!
I remember first reading about a similar concept in Joel Spolsky's blog, [Joel on Software](https://www.joelonsoftware.com/), where he wrote [back in 2007](https://www.joelonsoftware.com/2007/02/19/seven-steps-to-remarkable-customer-service/):
Fix everything two ways

Almost every tech support problem has two solutions. The superficial and immediate solution is just to solve the customer’s problem. But when you think a little harder you can usually find a deeper solution: a way to prevent this particular problem from ever happening again.

Obviously, I believe this principle applies to more than just customer service.

A related concept comes out of the Toyota, the Five Whys. Quoting from Wikipedia:

Five whys is an iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem. The primary goal of the technique is to determine the root cause of a defect or problem by repeating the question "Why?". Each answer forms the basis of the next question.

When tackling an observed problem, whether it be in code, business processes, or potentially even a leaky sink, I like to combine these principles into a technique I call “Solve Every Problem Twice”.

Solve Every Problem Twice

But as with the Five Whys, don’t take “twice” too literally. In practice, this technique should always yield a bare minimum of two solutions, but will often result in 5 or more practical solutions.

The steps I follow are:

1. The Five Whys

Use the Five Whys to determine the multiple causes of the observed problem.

2. Solve each problem at least once

Apply Joel Spolsky’s advice of solving each cause at least once. Each cause should have an immediate fix, and most will also have at least one deeper solution.

3. Repeat for the problem-solving process

Go through the first two steps again, this time for the process of solving the original observed problem.

An Real-World Example

To illustrate the technique in practice, let me describe a problem I ran into recently.

I wanted to do an update to one of the web sites I own, MinimalPairs.net, when I ran into a problem. I host the code for this web site on GitLab, where I use GitLab-CI for my continuous integration and deployment. I have GitLab-CI configured to create a review environment for me whenever a merge request is created.

When I recently pushed a change, I discovered the review environment was not working, with the famous “Your connection is not private” warning from Chrome which happens when an SSL certificate is broken.

I use Let’s Encrypt, which I’ve written about before, to manage my SSL certificates for me. Sometimes it can take a few minutes to get a new certificate, so I was patient. But half an hour later it was still not working, so I knew I had a legitimate problem.

With a little digging through my Kubernetes logs, I found the cause of the problem:

Status:
  Acme:
    Uri:
  Conditions:
    Last Transaction Time:  2019-11-18T08:41:49Z
    Message:                Failed to verify ACME account: acme: urn:ietf:params:acme:error:rateLimited: Your ACME client is too old. Please upgrade to a newer version.
    Reason:                 ErrRegisterACMEAccount
    Status:                 False
    Type:                   Ready

I then looked in the configuration for my kubernetes cluster and found that I was requiring version 0.5.2 of the cert-manager package.

    helm install stable/cert-manager \
        --name cert-manager \
        --version 0.5.2 \
        --set ingressShim.defaultIssureName=letsencrypt-prod \
        --set ingressShim.defaultIssureKind=ClusterIssuer \
        --namespace kube-system \
        --tls

At the time, was 0.11.0, so clearly an upgrade was in order.

With the immediate and root causes determined, let’s go through the steps outlined above.

The Five Whys

The observed problem was that the SSL certificate is broken, which leads to our first why:

1. Why is the SSL certificate broken?

The reason, as discovered above, is that I was using an old, unsupported version of the cert-manager package. This leads to the second why:

2. Why do I need to upgrade the certificate manager?

As you may recall from above, I was explicitly requesting version 0.5.2 of the cert-manager package. Perhaps it would be reasonable to always install the latest version.

Solve Each Problem At Least Once

Now I can go through the two problems I identified above, and resolve to solve each at least once.

1. Install the latest version of the certificate manager

This will solve the immediate, superficial problem, and get my web site working again.

2. No longer require a specific version, and always install the latest

This will prevent the problem from reoccurring in the future. Of course, this may open up my system to a new risk, in case a new version of cert-manager somehow breaks something, but it may be a risk worth taking.

Repeat for the Problem-Solving Process

But don't forget the final step! Repeat for the problem-solving process itself.

In my example, I found two areas where I believe I could have improved the process of fixing the problem.

1. I should have noticed the problem sooner

I don’t update MinimalPairs.net very often. For all I know, this problem may have been lying in wait for weeks before I attempted an update and noticed.

Two possible solutions come to mind for this problem. The first is to use a simple monitoring service to alert me when the web site’s SSL certificate is no longer working.

Second, and more proactively, I could use the same error logs which I used to debug the problem, and have them sent to a service such as Sentry.io, which can notify me immediately whenever a problem occurs.

In the spirit of solving each problem twice, I should do both of these.

2. It should have been easier to find the failure logs

Once the problem was identified, debugging it took longer than should have been necessary. This was largely due to the fact that Kubernetes doesn’t keep all logs in centralized location. This could be solved by setting up a centralized logging system. I already use Loggly for most of my logging, so I can just set it up to track my Kubernetes logs, as well.

Conclusion

Using my technique, I came up with five potential solutions to a simple SSL certificate problem:

  1. Upgrade the certificate manager
  2. Don’t depend on a specific version of the certificate manager
  3. Set up monitoring for the web site
  4. Set up error alerting
  5. Set up better logging

By applying all five of these solutions, I can ensure that not only have I solved the immediate problem, but that the overall health of my entire system is improving, and the next problem, no matter where it happens in the technology stack, will be that much easier to solve.


comments powered by Disqus