‘Knowing’ is not the same as ‘understanding’

Grrr… let this be a lesson: don’t assume that the error message returned knows what the problem is. The message ERR_INTERNET_DISCONNECTED may well not mean what it says.

Connection reset? Generally, that indicates a TCP issue at some link in the chain — and there are a huge number of links in that chain. But, let’s have a quick look at things: Internet is fine. Connectivity is fine. Things are working fine.

Maybe the browser can’t resolve that site.

How would we check?

Just ask DNS — DNS resolution in my network is every device forwarding to a single DNS node, which then forwards first to then to if the first one isn’t available. It will not switch between them unless a DNS server is unreachable.

I begin with an assumption that my resolver is fine. I tried a few generally unused hosts to see if they could resolve.

john:~ john$ dig @ one-confluence.pearson.com | grep -v "^;"

We also just have it throw out anything that begins with a semicolon, because at the moment, we don’t care. We only want to see the results. Hmm… nothing. Okay, let’s try the alternate resolver:

john:~ john$ dig @ one-confluence.pearson.com | grep -v "^;"
one-confluence.pearson.com. 126 IN CNAME one-confluence.glb.pearson.com.
one-confluence.glb.pearson.com. 29 IN A
Ah! Well, that's different.

A quick tweak to the router, and a refresh of the IP info on the workstation and…

Summary: yes, the errors mean things, but sometimes a developer jumps to their own conclusions.

Also, this tends to underscore that the two resolvers are not identical, nor are they expected to be.

A Reboot is Not A Fix

Somebody said, “I need to reboot the DB servers…” and I immediately felt my eye starting to twitch.

I quickly asked, “Why are you rebooting it?”

“Because it’s unstable,” was his reply.

“But what have you done to it to identify the cause?” I challenged, “What changes have you made to correct it?”

“Well, none! I’m rebooting it because it’s unstable!”

I’m not concerned about rebooting a DB server, I mean, there will be a brief period of unavailability, but instead because a reboot wouldn’t address the actual problem.

You see, a reboot, isn’t a fix. It’s only a postponement of the inevitable. You’ll still need to address the actual problem itself that led to the instability. Instead, you’ve a flawed belief that a reboot would fix something, or that it would buy you time — this time — so it could be fixed.

And, predictably, doing a reboot only masked the issue.


It was rebooted again about two hours later. Not fixed or troubleshot any further that time either. And is still in need of clearly understanding why it had become unstable to begin with: addressing the root cause of the instability itself.

It’ll be rebooted again. And again.

This Is An Alert…

Please god—you do not need to send alerts for absolutely every automatic process. And, no, “just in case” is even more meaningless.

Alerts are an interruption for someone. It’s a mechanism to tell someone that something has occured that the automation doesn’t know how to handle. This has happened, here’s why and what you need to do.

I can’t count the number of times that I’ve seen alerts going out to interrupt everyone to let them know that automation has done the automatic things that it was normally expected to do.

Image 5-1-19 at 10.12
In 24h, this channel was notifying the SRE team of — automation doing its job. 1,781 times.

It’s exactly like sending…


…a notification for every…


…heartbeat. I only want to know…


…if it’s not doing what it’s…


…implemented to handle…


…automatically. Then I’ll worry about it.