‘Knowing’ is not the same as ‘understanding’

Grrr… let this be a lesson: don’t assume that the error message returned knows what the problem is. The message ERR_INTERNET_RESET may well not mean what it says.

Connection reset? Generally, that indicates a TCP issue at some link in the chain — and there are a huge number of links in that chain. But, let’s have a quick look at things: Internet is fine. Connectivity is fine. Things are working fine.

Maybe the browser can’t resolve that site.

How would we check?

Just ask DNS — DNS resolution in my network is every device forwarding to a single DNS node, which then forwards first to 8.8.8.8 then to 8.8.4.4 if the first one isn’t available. It will not switch between them unless a DNS server is unreachable.

I begin with an assumption that my resolver is fine. I tried a few generally unused hosts to see if they could resolve.

john:~ john$ dig @8.8.8.8 one-confluence.pearson.com | grep -v "^;"

We also just have it throw out anything that begins with a semicolon, because at the moment, we don’t care. We only want to see the results. Hmm… nothing. Okay, let’s try the alternate resolver:

john:~ john$ dig @8.8.4.4 one-confluence.pearson.com | grep -v "^;"
one-confluence.pearson.com. 126 IN CNAME one-confluence.glb.pearson.com.
one-confluence.glb.pearson.com. 29 IN A 159.182.4.64
Ah! Well, that's different.

A quick tweak to the router, and a refresh of the IP info on the workstation and…

Summary: yes, the errors mean things, but sometimes a developer jumps to their own conclusions.

Also, this tends to underscore that the two resolvers are not identical, nor are they expected to be.

A Reboot is Not A Fix

Somebody said, “I need to reboot the DB servers…” and I immediately felt my eye starting to twitch.

I quickly asked, “Why are you rebooting it?”

“Because it’s unstable,” was his reply.

“But what have you done to it to identify the cause?” I challenged, “What changes have you made to correct it?”

“Well, none! I’m rebooting it because it’s unstable!”

I’m not concerned about rebooting a DB server, I mean, there will be a brief period of unavailability, but instead because a reboot wouldn’t address the actual problem.

You see, a reboot, isn’t a fix. It’s only a postponement of the inevitable. You’ll still need to address the actual problem itself that led to the instability. Instead, you’ve a flawed belief that a reboot would fix something, or that it would buy you time — this time — so it could be fixed.

And, predictably, doing a reboot only masked the issue.

Temporarily.

It was rebooted again about two hours later. Not fixed or troubleshot any further that time either. And is still in need of clearly understanding why it had become unstable to begin with: addressing the root cause of the instability itself.

It’ll be rebooted again. And again.

This Is An Alert…

Please god—you do not need to send alerts for absolutely every automatic process. And, no, “just in case” is even more meaningless.

Alerts are an interruption for someone. It’s a mechanism to tell someone that something has occured that the automation doesn’t know how to handle. This has happened, here’s why and what you need to do.

I can’t count the number of times that I’ve seen alerts going out to interrupt everyone to let them know that automation has done the automatic things that it was normally expected to do.

Image 5-1-19 at 10.12
In 24h, this channel was notifying the SRE team of — automation doing its job. 1,781 times.

It’s exactly like sending…

*thump*

…a notification for every…

*thump*

…heartbeat. I only want to know…

*thump*

…if it’s not doing what it’s…

*thump*

…implemented to handle…

*beeeeeeeep*

…automatically. Then I’ll worry about it.

Take some RISC

CPUs are insanely inefficient.

Fast, yes. They run at billions of calculations per second. But they also carry around a bit of bloat.

Bloat is one of the biggest tiny issues affecting tech today.

A CPU — Intel and AMD come to mind — have instruction sets that are rather large. It takes energy to cart all of those instructions around. Even when they aren’t used.

Without consideration of the concept of word-size*, we refer to them by an addressable bit-length: 4-bit, 8-bit… 64-bit. But let this sink in for a moment: a 4-bit instruction set is only 16 items (actually, a two-byte word)*. That actually represents a list of 16 (actually 256) possible instructions from which to draw. From those foundational instructions, we’ve managed to design and accomplish a great deal.

* yes, I know that it’s still dependent on word-size, linear, and physical address space. This is meant to be a rant and is a gross generalization

Consider the process of moving to a new memory point, reading a value, add it to a value at a separate location, putting the result into a new memory location… maybe 63 steps to do a particular ask.

And we want to do more. Decades ago, we observed that the way we were writing instructions was, in fact, insanely inefficient. The same process was repeated some number of times so we decided to expand the instruction set. So, rather than have compute cycles consumed by the common work of interpreting and reinterpreting our instructions, we could simply increase the number of instructions it could do and use a hard-coded function instead.

It’s identical capability, but it’s now a single instruction built right onto the chip. Instead of needing 372 of our steps to accomplish a particular task, it may actually need just three. It’s substantially more efficient.

And with the current swath of CPUs available, it’s ballooned to a 64-bit instruction set (actually 48-bits). Doing the math? That equates to an addressable set of millions of possible instructions. Of course, the list isn’t full — it still has necessary blank spots that literally do nothing at all. And there are so many hard-coded instructions available it’s unlikely anyone knows exactly what’s on the list.

But that one chip will still do anything one can currently conceive of.

Can’t select an instruction to do something? Write code that leverages the other instructions to do it. But you’re going to have a small performance hit. It’s so small, you won’t even notice it — the thing runs at 3-billion operations per second. How bad could it be?

For a price in either case (before or after hard-coding the process). It’ll also need to spend some time carting those instructions around. It’ll result in inefficiency and power consumption just keeping them at the ready… about 65 watts per chip package.

Unless you want to throw out those entirely-unneeded architectural instruction sets. You’ll be looking at designing a new chip with a reduced instruction set. Hmm, what if we call it a Reduced Instruction Set Computer?

Intel & AMD’s x86 architecture — where the x86/x86-64 is, in the current age, the family name and doesn’t refer directly to the instruction-set size — have onboard the entire possible set of addressable instructions.

But if you shift to a RISC-based system that includes in its instruction set most of what you want to accomplish, and isn’t taking up resources to keep uneeded capabilities alive, it’ll be more efficient.

So, what could consume a 65-watts per CPU package with AMD’s & Intel’s desktop-class packages (server-class is about twice the draw) and provide incredible capability, there are RISC packages that can outperform them while drawing only about 10-watts.

Take some RISC and move away from these monolithic architectures.

CPU Load Isn’t A Performance Indicator

Ever.

Let me clarify.

While CPU Load has value, it’s interpretation depends on a thorough understanding of what it’s actually indicating and why. It’s no longer meaningful as performance indicator, because it suggests a physical CPU’s ability to keep up with some processing load. This is easily grossly misinterpreted and misunderstood.

Why is this a problem?

Let’s start with jobs that are run during an idle state. Or, more specifically, the niceness of a process.

There are processes that, by design — and a lot of them — are de-prioritized and will use resources only if they’re available.

Anyone recall the SETI@home project? Remember how it would happily run “in the background”? It was designed with the priority concept where it would have some quantity of work that it needed to work on, but because it was a background job, if you wanted any resources at all, it would happily step out of the way and yield the CPU load.

It wasn’t the first. There are loads of system-level tasks that do exactly the same thing. Rather than sit idle, doing nothing at all until you ask it for something, there are loads of background tasks that it’s doing.

The classic interpretation of “CPU Load” or processor load and such as a metric indication of system performance would’ve suggested that the system load was too high and that it needed additional resources.

But it wasn’t.

It was just running exactly as it was designed and doing the work that it was meant to and would happily put things into a wait state if you wanted to do anything else.

In fact, even when effectively idle, a CPU will run at precisely the clock rate that it was intended to. Long ago, Unix started presenting a CPU Load Average — an indicator of the average number of processes that were waiting for the CPU (called a Wait State) during a the last one, five, or fifteen minutes.

These, too, will return an extremely high figure should a lower priority process want to run.

Normal.

Also, “Priorities”. They work.

Nowadays, we have much the same challenge — working to change perspectives, helping people to unlearn what they’ve learned — but on containerized workloads.

“But teh CPU! MOARSERVERESZ!”

Wrong perspective. The right perspective is to ask yourself instead, “How’s my app’s performance?” Is it responsive to requests that it receives? Ask yourself, “Have I taken every conceivable step I can to improve its performance?” You have? All of them that you can? Are you sure?Have you also taken steps to fundamentally shrink its instruction set so it’s not carrying around all of that unneeded bloat? I’m not referring to code-bloat — code-bloat is a separate problem there, too — but, more foundational than code: have you reduced the CPU instruction set?