One Read/One Write – Isn’t the whole story…

Not many years ago (dial-up days… dark times) we taught “Click save” for all of your important data.

Our current technology has evolved to the point, thankfully, where Save (or, worse, Save/Apply/OK) simply isn’t necessary anymore.

So back in the dial-up days, whenever one of our customers would have Finals-Week (many of them, around the same time each evening for about two or three weeks) students would all take their assorted three-hour exams at the same time.

Hundreds of colleges and universities. Each with thousands (or more) students. At around the same time each night during a two or three-week window. There could easily have been 200,000 or more taking their exams.

Consider the then mindset — we’d spent, by that time, decades teaching, “Click Save!”

You see, they (students, faculty, administration, developers) didn’t trust the database. The same database that housed every aspect of their identity, course list, the course content itself. Everything.

Sure, it had redundant power, network, CPU, disk — everything. Any conceivable hardware failure had redundancy.

But they didn’t trust this mysterious “database” thing.

They wanted — insisted on — a “just in case” solution.

Consider the introductory statement above. This, of course, led to, “What if we give them a Save button on the page?!”

Sure, it already had a Save button, which triggered a write to the DB and a refresh of the page. But it evolved: we also had it write their exam to a flat file.

Just in case.

It’s just one read, and one write, after all. It wouldn’t generate any extra load. Besides, “doing it right” would take too much work. Having something that saved with every click? Too much work.

Now, envision having 200,000 students all clicking “Save” every 30 seconds or so all during a three hour window.

The DB handled it just fine. It barely broke a sweat.

Even though we then tasked it with something more than just “update the database”. So, when somebody clicked Save, we’d have it:

  1. write to the DB, then…
  2. connect to storage
  3. check the reference table to then check the right folder
  4. check that folder’s file count
  5. wait while storage reported the number of objects
  6. create a new directory if the current one had too many objects
  7. update the reference table then
  8. write a plain-text copy of that user’s exam
  9. respond to the Save request by with a page refresh

It’s just one read, one write. What could go wrong?

Now, do it all, 200,000 times. Every 30 seconds.

Oh, and as the number of files grew in that folder, it would take longer… and longer… and longer just to see if it needed to move. In fact, it would actually result in those storage devices dropping offline because they were so busy checking to see what the directory’s object count was.

So, it was reported as “one read/one write” every 30 seconds, which sounded trivial enough. That became a bit less trivial when it’s multiplied by 200,000 students doing the same every 30 seconds and the time increases as the file count does.

It turned out that the helpful “fix” was entirely self-inflicted. It began with very premise that the DB wasn’t trusted. This was both compounded and complicated by a few misunderstandings and misrepresentations of the nature of the data and how data moves around.

All because somebody didn’t trust a database and it was just “one read/one write”.

The database? It didn’t have any problems at all. Well, it did every now and again, but that’s not the point of this particular rant.

The time needed to been spent more meaningfully by educating customers about the reliability of these new-fangled “compuserves“, “interwebz“, and “databasing” things.

And to make it just one read/one write, that Save button would’ve had a more meaningful job of doing nothing new any more complex than simply updating a database and refreshing the page — well, that, and perhaps reducing the risk of the student’s internet connection timing out.

Curious Web Traffic Coming from Germany

I started noticing these in my ingress logs recently:

2017/08/18 20:14:51 [error] 27277#27277: *174208140 open() "/usr/share/nginx/html/YesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScann" failed (36: File name too long), client: 137.226.113.11, server: , request: "GET /YesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreSca

They seem to have started 09 Aug. I’m getting them in short bursts – six requests in rapid succession. Then another burst a few hours later. Then a day or so later, several more.

Alas, our logging doesn’t capture the whole record, so I can’t see the user-agent as the message suggests. But I can see a client IP address: 137.226.113.11

There’s nothing interesting straight away at that IP. It is hosting a web service, but its default doc is a 404 at the moment: http://137.226.113.11.

But what else is there? A trace shows it’s in Germany and likely resolves to something owned by rwth-aachen.de:

$ traceroute 137.226.113.11
    traceroute to 137.226.113.11 (137.226.113.11), 64 hops max, 52 byte packets
     1  rac1.esthermofet.net (192.168.1.1)  3.624 ms  0.777 ms  0.627 ms
     2  63.140.19.1.ifibertv.com (63.140.19.1)  2.557 ms  2.568 ms  3.746 ms
     3  208.84.220.193.ifibertv.com (208.84.220.193)  2.446 ms  2.576 ms  2.462 ms
     4  174.127.154.33 (174.127.154.33)  8.541 ms  8.791 ms  8.765 ms
     5  ae9.mpr1.sea1.us.above.net (208.185.155.61)  8.415 ms  8.326 ms  8.561 ms
     6  ae27.cs1.sea1.us.eth.zayo.com (64.125.29.0)  149.309 ms  141.520 ms  141.475 ms
     7  ae2.cs1.ord2.us.eth.zayo.com (64.125.29.27)  142.018 ms  153.854 ms  141.519 ms
     8  ae3.cs1.lga5.us.eth.zayo.com (64.125.29.208)  158.628 ms  141.480 ms  148.802 ms
     9  ae5.cs1.lhr11.uk.eth.zayo.com (64.125.29.127)  141.373 ms  141.690 ms  141.361 ms
    10  ae6.cs1.ams10.nl.eth.zayo.com (64.125.29.76)  141.582 ms  141.623 ms  144.583 ms
    11  ae0.cs1.ams17.nl.eth.zayo.com (64.125.29.81)  141.803 ms  141.407 ms  141.834 ms
    12  ae2.cs1.fra6.de.eth.zayo.com (64.125.29.58)  142.005 ms  141.638 ms  141.478 ms
    13  ae27.mpr1.fra3.de.zip.zayo.com (64.125.31.217)  141.756 ms  141.556 ms  141.549 ms
    14  ae8.mpr1.fra4.de.zip.zayo.com (64.125.26.234)  141.561 ms  141.473 ms  141.273 ms
    15  * * *
    16  kr-aah15-0.x-win.dfn.de (188.1.242.110)  177.355 ms  177.996 ms  177.197 ms
    17  fw-xwin-2-vl106.noc.rwth-aachen.de (134.130.3.230)  170.592 ms  171.043 ms  170.466 ms
    18  n7k-ww10-1-vl158.noc.rwth-aachen.de (134.130.3.243)  174.294 ms  174.368 ms  174.157 ms
    19  n7k-ww10-3-po1.noc.rwth-aachen.de (134.130.9.166)  174.397 ms  174.189 ms  175.918 ms
    20  c4k-i4-1.noc.rwth-aachen.de (137.226.35.67)  172.026 ms  171.764 ms  173.803 ms
    21  researchscan4.comsys.rwth-aachen.de (137.226.113.11)  173.819 ms  173.620 ms  173.683 ms

Okay, the penultimate stop also resolves to rwth-aachen.de, so I think a full port scan of that host’s segment might reveal some useful details, but let’s just scan this one host and see what we get:

# nmap 137.226.113.11

Starting Nmap 6.40 ( http://nmap.org ) at 2017-08-19 00:56 UTC
Nmap scan report for researchscan4.comsys.rwth-aachen.de (137.226.113.11)
Host is up (0.13s latency).
Not shown: 994 closed ports
PORT     STATE    SERVICE
80/tcp   open     http
135/tcp  filtered msrpc
139/tcp  filtered netbios-ssn
445/tcp  filtered microsoft-ds
1433/tcp filtered ms-sql-s
2323/tcp filtered 3d-nfsd

Nmap done: 1 IP address (1 host up) scanned in 23.45 seconds

An MS host, eh? With MS SQL exposed to the internet? Looks like there might be some shenanigans going on there.

Let’s scan their whole segment. Maybe there’s something else with port 80 open that we can casually observe:

# nmap 137.226.113.0-255 -p80 --open

    Starting Nmap 6.40 ( http://nmap.org ) at 2017-08-19 01:03 UTC
    Nmap scan report for jodelforschung.comsys.rwth-aachen.de (137.226.113.6)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan1.comsys.rwth-aachen.de (137.226.113.8)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan2.comsys.rwth-aachen.de (137.226.113.9)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan3.comsys.rwth-aachen.de (137.226.113.10)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan4.comsys.rwth-aachen.de (137.226.113.11)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan5.comsys.rwth-aachen.de (137.226.113.12)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan6.comsys.rwth-aachen.de (137.226.113.13)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan7.comsys.rwth-aachen.de (137.226.113.14)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan8.comsys.rwth-aachen.de (137.226.113.15)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan9.comsys.rwth-aachen.de (137.226.113.16)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan10.comsys.rwth-aachen.de (137.226.113.17)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan11.comsys.rwth-aachen.de (137.226.113.18)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan12.comsys.rwth-aachen.de (137.226.113.19)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for vega.comsys.rwth-aachen.de (137.226.113.26)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan19.comsys.rwth-aachen.de (137.226.113.27)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap scan report for researchscan20.comsys.rwth-aachen.de (137.226.113.28)
    Host is up (0.13s latency).
    PORT   STATE SERVICE
    80/tcp open  http

    Nmap done: 256 IP addresses (243 hosts up) scanned in 10.17 seconds

Wow. Okay, let’s just ask those for their default pages and see what comes back.

nmap 137.226.113.0-255 -p80 -oG outfile
for i in `cat outfile | grep -i open | awk ‘{print $2}’
do
echo $i
curl http://${i}
done

That produced loads of output, but it did come back with this:

Ah, a research project!

I’m afraid their reasoning is a bit vague. “…helps computer scientists study the deployment and configuration of network protocols and security technologies…”

You don’t have to be a computer scientist to understand that sending abnormally long requests to sites will result in an error.

For us, this isn’t causing any sort of problem – it’s really just a minor mystery I noticed in the logs while researching something else entirely.

For now, I’m going to leave our firewall unchanged. But I’m also going to log and chart their probes to see if the behavior changes over time.

I’ll do my own research project.

Heisenberg Monitoring Uncertainty Principle

In certain implementations of software monitoring solutions, the type, quantity, and frequency of monitoring – the system or service checks – can result in an increase in load on the systems being tested. This increased load can lead to the flawed interpretation that additional monitoring tools are necessary to identify the load factors, resulting in further-increased load.

Or, to summarize: throw so much monitoring at a platform that it unexpectedly increases load, which prompts additional monitoring. Repeat.

Or, to summarize the summary: You cannot observe any system without impacting it.

Security Fail n+1… +1

One of the things that frustrates me is when a site – or worse, a group within my own organization – tells me that my password contains characters that aren’t allowed. Or that my password is too long.

Really? So what you’re saying is that you want me to trust that your team’s developers have good security by using a weaker standard than my own?

You need to change your hash algorithms to accept unicode strings of any reasonable length – and, yes, 256 characters of unicode is a reasonable length for a password.

Also, I just spotted a maddening double-shot of security bumbling with an organization that has integrated with Google Auth. The issue isn’t that they’ve integrated with Google Auth – that’s good – but it’s that they’ve disabled the ability to use two-factor authentication therein.

They’re improving usability by using single sign-on, but increasing the attack surface by disabling a proven security feature.

Oh, and they only allow ASCII for passwords. And not even all of them.