The Panacea-Tool Incident

A few years ago — 2014 maybe? — we were in the early days of distributed teams and were spread across three timezones. Timing was awkward. So, many of us would start our day from home to join calls and meetings. This was, for us the beginnings of regular telecommuting. To help ease the communications challenges, we also embraced the concepts of video conferencing, screen shares, and multimedia to communicate.

One morning, there was expert brought in to demonstrate and train the lot of us on the new Panacea that the company had invested in: an app that would help manage all of our systems. It was a unified, do-everything tool that would provide visibility of specific known-states and anomalies on any number of systems across our several geographic locations and datacenters. It would pin down the precise, exact origin of a problem, and eliminate the need to log into a server (via SSH, of course) ever again… in order to resolve the issue.

Anyway, while doing the demo, there was this one error that would occur, which would prevent moving any further with a demo or training.

It was something about a missing object, or log file, or permissions to it.

If only there was a tool that had the power and capacity to identify the problem and resolve it… we could use that. It would be a perfect opportunity!

Their sales engineer was stumped.

After he fought with it for half an hour or so, I suggested, that we take a quick look at the actual logs on the system. Odds are pretty good that they’d indicate where the problem was. There was no harm in checking.

“No!” he’d assert. “That’s the wrong way!” And we endured continuous rants of frustrations and borderline vulgarities from him. “This guy!” he had jokingly exclaimed, “What you want to do is impossible!

Oh, I’m sorry… I thought you had used the word, “IMPOSSIBLE.” just there.

Challenge accepted.

I quickly shared my screen and jumped over and skimmed the actual logs from the app on the server itself. Let’s see… at the end of the log file, it had logged that it had crashed. Why? Scroll up a few lines and… permission denied trying to write to one of its own files.

“Oh! I’ll just ‘chmod’ that file so its owner can write to it…”

He boisterously interrupted, “If that’s it, I’ll buy you a steak dinner!”

**tap,tap,tap** **Enter** “Okay, all set… let’s give it another try really quick…”

The problem went away. He was clearly offended that somebody could’ve done it “the wrong way” to find the problem and fix it so quickly.

Took about 20 seconds.

And the really amusing part is that all of this was perfect scenarios to demonstrate the power and capability of the app itself.

CPU Load Isn’t A Performance Indicator

Ever.

Let me clarify.

While CPU Load has value, it’s interpretation depends on a thorough understanding of what it’s actually indicating and why. It’s no longer meaningful as performance indicator, because it suggests a physical CPU’s ability to keep up with some processing load. This is easily grossly misinterpreted and misunderstood.

Why is this a problem?

Let’s start with jobs that are run during an idle state. Or, more specifically, the niceness of a process.

There are processes that, by design — and a lot of them — are de-prioritized and will use resources only if they’re available.

Anyone recall the SETI@home project? Remember how it would happily run “in the background”? It was designed with the priority concept where it would have some quantity of work that it needed to work on, but because it was a background job, if you wanted any resources at all, it would happily step out of the way and yield the CPU load.

It wasn’t the first. There are loads of system-level tasks that do exactly the same thing. Rather than sit idle, doing nothing at all until you ask it for something, there are loads of background tasks that it’s doing.

The classic interpretation of “CPU Load” or processor load and such as a metric indication of system performance would’ve suggested that the system load was too high and that it needed additional resources.

But it wasn’t.

It was just running exactly as it was designed and doing the work that it was meant to and would happily put things into a wait state if you wanted to do anything else.

In fact, even when effectively idle, a CPU will run at precisely the clock rate that it was intended to. Long ago, Unix started presenting a CPU Load Average — an indicator of the average number of processes that were waiting for the CPU (called a Wait State) during a the last one, five, or fifteen minutes.

These, too, will return an extremely high figure should a lower priority process want to run.

Normal.

Also, “Priorities”. They work.

Nowadays, we have much the same challenge — working to change perspectives, helping people to unlearn what they’ve learned — but on containerized workloads.

“But teh CPU! MOARSERVERESZ!”

Wrong perspective. The right perspective is to ask yourself instead, “How’s my app’s performance?” Is it responsive to requests that it receives? Ask yourself, “Have I taken every conceivable step I can to improve its performance?” You have? All of them that you can? Are you sure?Have you also taken steps to fundamentally shrink its instruction set so it’s not carrying around all of that unneeded bloat? I’m not referring to code-bloat — code-bloat is a separate problem there, too — but, more foundational than code: have you reduced the CPU instruction set?

The one with an overbuilt solution…

Then there was the one about the overbuilt-solution.

We were a “Microsoft Shop”. Windows NT4 was in full-swing. Virtualization was in its infancy.

Leadership discouraged the exploration of “best of breed” solutions. Making things more useful by simplifying them wasn’t permitted. This often meant that the alternative was interpreted as “not Windows.”

Windows was “the only solution”.

One of the applications that was needed was something to translate the incompatible line-endings of submitted text files from a particular customer into something that Windows could read.

I knew this would be as simple as a cron job to trigger a periodic unix2dos command. It’s a built-in command. Trivial for a low-priority utility box. And because it was coming from a customer’s *nix system, that command could even be injected quite harmlessly into their workflow so it happened before it was even sent to us.

“Impossible!” leadership would howl.

Rather than ensure it’s done before transfer, we’ll do it after. We’d also encounter the cost of this particular server sitting entirely idle except for the two times per day (about 1/2 second each) that it would have to do it’s assigned job. It couldn’t be tasked with any other process or job because, in those days, a server was assigned one task. This went far beyond company policy and was engrained in the very thinking — the cultural belief in IT fields.

So, we did it the enterprise way because “that’s the way it’s always been done.”

  • Select hardware, because virtualization was such a new concept (several years by that time) that it couldn’t be trusted.
  • Buy a new server, for about $3k with a suitable onboard RAID-1, dual NICs, dual-power, dual-CPUs. It’ll draw about 100 watts at idle. Always. It will occasionally run a bit higher than that, but it’s practically idle. All the time.
  • Wait about two months for the servers to arrive.
  • Buy a Windows license.
  • Buy a patching license.
  • Double it (again) because policy would require redundancy. They’ll need to be installed in pairs at least per environment.
  • Another thing policy would require is that equal hardware must also then be deployed to Staging and Prod environments. That results in deployment of six servers, minimum.
  • Ensure we have physical space and capacity in the datacenter to support those six servers — because policy — so, that’s the maximum possible load of 230w each, times six.
  • And that’s beyond the unprovisioned $24k hardware cost.

Don’t forget to ensure it’s included in the security/patching list.

And update the inventory list. Because we’ll also need to dispose of it in five years’ time.

Y’know, spending all of that time making the assorted “impossible” claims will really irritates people who are already doing the impossible.

“Challenge accepted.”

So we did it the more efficient way. A simple unix2dos command on an already-existing, low-priority Linux utility VM. Yep, we managed to sneak one of those in. Two, actually. And it ran flawlessly for several years. And it was low priority. If something else needed resources, it would happily step entirely out of the way and wait.

As I recall, it was literally:

unix2dos -k ${filename}

Actually, it was stuck in crontab, so it was very slightly more complex.

Line-endings wouldn’t be an issue today, of course. Operating systems thankfully are graceful enough to ignore certain low-level encoding limitations.

Mostly.

Brutal Week, Appropriate Quote

We’ve had a pretty rough week, this, at the office. A number of small issues, several still unexplained, have affected various aspects of our production systems — ultimately affecting our customers.

Another engineer was asked to describe one of the rather long outages for upper management. He sent only this:

“We experienced a stage one resumé-generating event.”

Lots of engineers were indeed busy updating their resumés just in case they would need them a few days from now.